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Parameter estimation by nonlinear least squares minimization is a common problem that has an 
elegant geometric interpretation: the possible parameter values of a model induce a manifold within 
the space of data predictions. The minimization problem is then to find the point on the manifold 
closest to the experimental data. We show that the model manifolds of a large class of models, 
known as sloppy models, have many universal features; they are characterized by a geometric series of 
widths, extrinsic curvatures, and parameter-effects curvatures, which we describe as a hyper-ribbon. 
A number of common difficulties in optimizing least squares problems are due to this common 
geometric structure. First, algorithms tend to run into the boundaries of the model manifold, 
causing parameters to diverge or become unphysical before they have been optimized. We introduce 
the model graph as an extension of the model manifold to remedy this problem. We argue that 
appropriate priors can remove the boundaries and further improve the convergence rates. We show 
that typical fits will have many evaporated parameters unless the data are very accurately known. 
Second, 'bare' model parameters are usually ill-suited to describing model behavior; cost contours 
in parameter space tend to form hierarchies of plateaus and long narrow canyons. Geometrically, we 
understand this inconvenient parameterization as an extremely skewed coordinate basis and show 
that it induces a large parameter-effects curvature on the manifold. By constructing alternative 
coordinates based on geodesic motion, we show that these long narrow canyons are transformed 
in many cases into a single quadratic, isotropic basin. We interpret the modified Gauss-Newton 
and Levenberg-Marquardt fitting algorithms as an Euler approximation to geodesic motion in these 
natural coordinates on the model manifold and the model graph respectively. By adding a geodesic 
acceleration adjustment to these algorithms, we alleviate the difficulties from parameter-effects 
curvature, improving both efficiency and success rates at finding good fits. 

PACS numbers: 02.60.Ed, 02.40.Ky, 02.60.Pn, 05.10.-a 



I. INTRODUCTION 

An ubiquitous problem in mathematical modeling in- 
volves estimating parameter values from observational 
data. One of the most common approaches to the prob- 
lem is to minimize a sum of squares of the deviations of 
predictions from observations. A typical problem may 
be stated as follows: given a regressor variable, t, sam- 
pled at a set of points {t m } with observed behavior {y m } 
and uncertainty {a m }, what values of the parameters, 9, 
in some model f(t,8), best reproduce or explain the ob- 
served behavior? This optimal value of the parameters 
is known as the best fit. 

To quantify how good a fit is, the standard approach 
is to assume that the data can be reproduced from the 
model plus a stochastic term that accounts for any dis- 
crepancies. That is to say 

where C, m are random variables assumed to be indepen- 
dently distributed according to N(Q,a m ). Written an- 
other way, the residuals given by 

r m (9) = ' /m ~ /(tm ' 0) , (1) 

are random variables that are independently, normally 
distributed with zero mean and unit variance. The prob- 



ability distribution function of the residuals is then 

m9) = (2^ 6XP [~\ I>"^ 2 ) . ( 2 ) 

where M is the number of residuals. The stochastic part 
of the residuals is assumed to enter through its depen- 
dence on the observed data, while the parameter depen- 
dence enters through the model. This distinction implies 
that while the residuals are random variables, the matrix 
of derivatives of the residuals with respect to the param- 
eters is not. We represent this Jacobian matrix by J mfJ- : 

J m ii — I'm ■ 

For a given set of observations {j/m}, the distribution 
in Eq. is a likelihood function, with the most likely, or 
best fit, parameters being those that minimize the cost 
function, C, defined by 

C{9)= l -Y. r ra{9f, (3) 

rn 

which is a sum of squares. Therefore, if the noise is Gaus- 
sian (normally) distributed, minimizing a sum of squares 
is equivalent to a maximum likelihood estimation. 

If the model happens to be linear in the parameters it 
is a linear least squares problem and the best fit values 
of the parameters can be expressed analytically in terms 
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of the observed data and the Jacobian. If, however, the 
model is nonlinear, the best fit cannot be found so eas- 
ily. In fact, finding the best fit of a nonlinear problem 
can be a very difficult task, notwithstanding the many 
algorithms that are designed for this specific purpose. 

For example, a nonlinear least squares problem may 
have many local minima. Any search algorithm that is 
purely local will at best converge to a local minima and 
fail to find the global best fit. The natural solution is to 
employ a search method designed to find a global minima, 
such as a genetic algorithm or simulated annealing. We 
will not address such topics in this paper, although the 
geometric framework that we develop could be applied 
to such methods. We find, surprisingly, that most fitting 
problems do not have many local minima. Instead, we 
find a universality of cost landscapes, as we discuss later 
in section Hm consisting of only one, or perhaps very few, 
minima. 

Instead of difficulties from local minima, the best fit 
of a nonlinear least squares problem is difficult to find 
because of sloppiness, particularly if the model has many 
parameters. Sloppiness is the property that the behavior 
of the model responds very strongly to only a few com- 
binations of parameters, known as stiff parameter com- 
binations, and very weakly to all other combinations of 
parameters, which are known as sloppy parameter combi- 
nations. Although the sloppy model framework has been 
developed in the context of systems biology [l|-|7], models 
from many diverse fields have been shown to lie within 
the sloppy model universality class [8j. 

In this paper we present the geometric framework for 
studying nonlinear least squares models. This approach 
has a long, interesting history, originating with Jeffreys in 
1939 El, a nd later continued by Rao [lQ, [Tl| and many 
others O fl3j . An equivalent, alternative formulation 
began with Beale in 1960 flill. and continued with the 
work of Bates and Watts [15l-fli| and others [H-[2l|. 
The authors have used this geometric approach previ- 
ously to exp lain the extreme difficulty of the data fitting 
process |22| ; of which this work is a continuation. 

In section |n] we present a review of the phenomenon 
of sloppiness and describes the model manifold, i.e. the 
geometric interpretation of a least squares model. The 
geometric picture naturally illustrates two major diffi- 
culties that arise when optimizing sloppy models. First, 
parameters tend to diverge or drift to unphysical values, 
geometrically corresponding to running off the edge of 
the manifold, as we describe in section ITlTl This is a con- 
sequence of the model manifold having boundaries that 
give it the shape of a curving hyper-ribbon in residual 
space with a geometric hierarchy of widths and curva- 
tures. We show, in section Hvl that the model graph, the 
surface formed by plotting the residual output versus the 
parameters, can help to remove the boundaries and im- 
prove the fitting process. Generalizing the model graph 
suggests the use of priors as additional residuals, as we do 
in section [V] We see there that the natural scales of the 
experiment can be a guide to adding priors to the cost 



function that can significantly improve the convergence 
rate. 

The second difficulty is that the model's 'bare' param- 
eters are often a poor coordinate choice for the manifold. 
In section lVTl we construct new coordinates, which we call 
extended geodesic coordinates. The coordinates remove 
the effects of the bad coordinates all the way to the edge 
of the manifold. The degree to which extended geodesic 
coordinates are effective at facilitating optimization is 
related to the curvature of the manifold. Section IVIII 
discusses several measures of curvature and explores cur- 
vature of sloppy models. We show that the parameter- 
effects curvature is typically the dominant curvature of 
a sloppy model, explaining why extended geodesic coor- 
dinates can be a huge simplification to the optimization 
process. We also show that typical best fits will usually 
have many evaporated parameters and then define a new 
measure of curvature, the optimization curvature, that is 
useful for understanding the limitation of iterative algo- 
rithms. 

We apply geodesic motion to numerical algorithms in 
section I Villi where we show that the modified Gauss- 
Newton method and Levenberg-Marquardt method are 
an Euler approximation to geodesic motion. We then 
add a geodesic acceleration correction to the Levenberg- 
Marquardt algorithm and achieve much faster conver- 
gence rates over standard algorithms and more reliability 
at finding good fits. 

II. THE MODEL MANIFOLD 

In this section we review the properties of sloppy mod- 
els and the geometric picture naturally associated with 
least squares models. To provide a concrete example of 
sloppiness to which we can apply the geometric frame- 
work, consider the problem of fitting three monotonically 
decreasing data points to the model 

y (t,6) = e~ t01 +e~ te \ 

Although simple, this model illustrates many of the prop- 
erties of more complicated models. Figure [T^, is an illus- 
tration of the data and several progressively better fits. 
Because of the noise, the best fit does not pass exactly 
through all the data points, although the fit is within the 
errors. 

A common tool to visualize the parameter dependence 
of the cost is to plot contours of constant cost in param- 
eters space, as is done for our toy model in Figure [Tp. 
This view illustrates many properties of sloppy models. 
This particular model is invariant to a permutation of the 
parameters, so the plot is symmetric for reflections about 
the 6*i = 82 line. We refer to the B\ = O2 linear as the 
"fold line" for geometric reasons that will be apparent 
in section IIV1 Around the best fit, cost contours form 
a long narrow canyon. The direction along the length 
of the canyon is a sloppy direction, since this parameter 
combination hardly changes the behavior of the model, 
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FIG. 1: (a) Fitting a nonlinear function to data, in this case the sum of two exponentials to three data points. Fit A 
has rate constants which decay too quickly, resulting in a poor fit; B is an improvement over Fit A, although the rates are too 
slow; the best fit minimizes the cost (the sum of the squares of the residuals, which are deviations of model from data points) 
(b) Contours of constant Cost in parameter space. Note the "plateau" in the region of large rates where the model 
is essentially independent of parameter changes. Note also the long, narrow canyon at lower rates, characteristic of a sloppy 
model. The sloppy direction is parallel to the canyon and the stiff direction is against the canyon wall, (c) Model predictions 
in data space. The experimental data is represented by a single point. The set of all possible fitting parameters induce a 
manifold of predictions in data space. The best fit is the point on the manifold nearest to the data. The plateau in (b) here is 
the small region around the short cusp near the corner. 



and the direction up a canyon wall is the stiff direction. 
Because this model has few parameters, the sloppiness 
is not as dramatic as it is for most sloppy models. It is 
not uncommon for real-life models to have canyons with 
an aspect ratios much more extreme than in Fig. [T}d, 
typically 1000 : 1 or more for models with 10 or more 
parameters [fj. 

Sloppiness can be quantified by considering the 
quadratic approximation of the cost around the best fit. 
The Hessian (second derivative) matrix, H^ v , of the cost 
at the best fit has eigenvalues that span many orders 
of magnitude and whose logarithms tend to be evenly 



spaced, as illustrated in Fig. [5] Eigenvectors of the Hes- 
sian with small eigenvalues are the sloppy directions, 
while those with large eigenvalues are the stiff directions. 
In terms of the residuals, the Hessian is given by 



Hfj, v = d^dyG 

= ^ 9 n r mdvr m + ^ r md^d v r m (4) 
m Ta 

(5) 

m 

= (J T J)„ (6) 



4 



10° 
10" 4 



io 16 | == - 

FIG. 2: Hessian eigenvalues for three sloppy models. Note 
the extraordinarily large range of eigenvalues (15-17 orders of 
magnitude, corresponding to to valley aspect ratios of 10 7 - 
10 9 ) in Fig.[Tp. Notice also the roughly equal fractional spac- 
ing between eigenvalues-there is no clean separation between 
important (stiff) and irrelevant (sloppy) direction in parame- 
ter space, a) The model formed by summing six exponential 
terms with rates and amplitudes. We use this model to in- 
vestigate curvature in section I VIII and as a test problem to 
compare algorithms in section IVIII El b) The linear prob- 
lem of fitting polynomials is sloppy with the Hessian given 
by the Hilbert matrix, c) A more practical model from sys- 
tems biology of signaling the epidermal growth factor in rat 
pheochromocytoma (PC12) cells Q, which also has a sloppy 
eigenvalue spectrum. Many more examples can be found in 
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In the third and fourth line we have made the approx- 
imation that at the best fit the residuals are negligible. 
Although the best fit does not ordinarily corresponds to 
the residuals being exactly zero, the Hessian is usually 
dominated by the term in Eq. ([SJ when evaluated at 
the best fit. Furthermore, the dominant term, J T J, is 
a quantity important geometrically which describes the 
model-parameter response for all values of the parame- 
ters independently of the data. The approximate Hessian 
is useful to study the sloppiness of a model independently 
of the data at points other than the best fit. It also shares 
the sloppy spectrum of the exact Hessian. We call the 
eigenvectors of J T J the local eigenparameters as they 
embody the varying stiff and sloppy combinations of the 
'bare' parameters. 

In addition to the stiff and sloppy parameter com- 
binations near the best fit, Fig. [T|d also illustrates an- 
other property common to sloppy models. Away from 
the best fit the cost function often depends less and less 
strongly on the parameters. The contour plot shows a 
large plateau where the model is insensitive to all pa- 
rameter combinations. Because the plateau occupies a 
large region of parameter space, most initial guesses will 
lie on the plateau. When an initial parameter guess does 
begin on a plateau such as this, even finding the canyon 
can be a daunting task. 

The process of finding the best fit of a sloppy model, 
usually consists of two steps. First, one explores the 
plateau to find the canyon. Second, one follows the 



canyon to the best fit. One will search to find a canyon 
and follow it, only to find a smaller plateau within 
the canyon that must then be searched to find another 
canyon. Qualitatively, the initial parameter guess does 
not fit the data, and the cost gradient does not help much 
to improve the fit. After adjusting the parameters, one 
finds a particular parameter combination that can be ad- 
justed to fit some clump of the data. After optimizing 
this parameter combination (following the canyon), the 
fit has improved but is still not optimal. One must then 
search for another parameter combination that will fit an- 
other aspect of the data, i.e. find another canyon within 
the first. Neither of these steps, searching the plateau or 
following the canyon, is trivial. 

Although plotting contours of constant cost in param- 
eter space can be an useful and informative tool, it is 
not the only way to visualize the data. We now turn to 
describing an alternative geometric picture that helps to 
explain why the the processes of searching plateaus and 
following canyons can be so difficult. The geometric pic- 
ture provides a natural motivation for tools to improve 
the optimization process. 

Since the cost function has the special form of a sum 
of squares, it has the properties of a Euclidean distance. 
We can interpret the residuals as components of an M- 
dimensional residual vector. The M-dimensional space 
in which this vector lives is a Euclidean space which we 
refer to as data space. By considering Eq. ([T]) , we see that 
the residual vector is the difference between a vector rep- 
resenting the data and vector representing the model (in 
units of the standard deviation). If the model depends 
on N parameters, with N < M, then by varying those 
N parameters, the model vector will sweep out an N- 
dimensional surface embedded within the Af-dimensional 
Euclidean space. We call this surface the model man- 
ifold, it is sometimes also known as the expectation or 
regression surface [HI, [23|. The model manifold of our 
toy model is shown in Fig. [TJ3. The problem of minimiz- 
ing the cost is thus translated into the geometric problem 
of finding the point on the model manifold that is closest 
to the the data. 

In transitioning from the parameter space picture to 
the model manifold picture, we are now faced with the 
problem of minimizing a function on a curved surface. 
Optimization on manifolds is a problem that has been 
given much attention in recent decades ■{•'> . The gen- 
eral problem of minimizing a function on a manifold is 
much more complicated than our problem; however, be- 
cause the cost function is linked here to the structure of 
the manifold the problem at hand is much simpler. 

The metric tensor measures distance on the manifold 
corresponding to infinitesimal changes in the parameters. 
It is induced from the Euclidean metric of the data space 
and is found by considering how small changes in param- 
eters correspond to changes in the residuals. The two are 
related through the Jacobian matrix, 

dr m = d^dB^ = Jm^de^, 
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where repeated indices imply summation. (We also em- 
ploy the convention that Greek letters index parameters, 
while Latin letters index data points, model points, and 
residuals.) The square of the distance moved in data 
space is then 

dr 2 = (J T J)^d6>*d6 u . (7) 

Eq. ((7| is known as the first fundamental form, and the 
coefficient of the parameter infinitesimals is the metric 
tensor, 

Ql±v — {J J} '/^f — ^ ^ Qji^raQy'^ va- 
ra 

The metric tensor corresponds to the approximate Hes- 
sian matrix in Eq. ([S]); therefore, the metric is the Hessian 
of the cost at a point assuming that the point exactly re- 
produced the data. 

Qualitatively, the difference between the metric tensor 
and the Jacobian matrix is that the former describes the 
local intrinsic properties of the manifold while the lat- 
ter describes the local embedding. For nonlinear least 
squares fits, the embedding is crucial, since it is the em- 
bedding that defines the cost function. To understand 
how the manifold is locally embedded, consider a singu- 
lar value decomposition of the Jacobian 

J = UT,V T , 

where V is an N x N unitary matrix satisfying V T V = 1 
and £ is an N x N diagonal matrix of singular values. 
The matrix U is almost unitary, in the sense that it is 
an M x N matrix satisfying U T U = 1; however, UU T 
is not the identity [34]. In other words, the columns of 
U contain N residual space vectors that are orthonormal 
spanning the range of J and not the whole embedding 
space. In terms of the singular value decomposition, the 
metric tensor is then given by 

g = VY?V T , 

showing us that V is the matrix whose columns are the 
local eigenparameters of the metric with eigenvalues A; = 

si 

The singular value decomposition tells us that the Ja- 
cobian maps metric eigenvectors onto the data space vec- 
tor Ui and stretched by an amount \f\i. We hence de- 
note the columns of U the eigenpredictions. The product 
of singular values describes the mapping of local volume 
elements of parameter space to data space. A unit hyper- 
cube of parameter space is stretched along the eigen- 
predictions by the appropriate singular values to form 
a skewed, hyper-parallelepiped of volume \/\g\- 

The Jacobian and metric contain the first derivative 
information relating changes in parameters to changes in 
residuals or model behavior. The second derivative in- 
formation is contained in the connection coefficient. The 
connection itself is a technical quantity describing how 
basis vectors on the tangent space move from point to 



point. The connection is also closely related to geodesic 
motion, introduced properly in section PVTl Qualitatively 
it describes how the metric changes from point to point 
on the manifold. The relevant connection is the Riemann, 
or metric, connection; it is calculated from the metric by 

or in terms of the residuals 

r ^=S^E^ r "A^ m , (8) 

m 

where g^ v = (g^ 1 )^ ■ One could now also calculate the 
Riemann curvature by application of the standard for- 
mulae; however, we postpone a discussion of curvature 
until section IVHl For a more thorough discussion of con- 
cepts from differential geometry, we refer the reader to 
any text on the subject [35l - l38| . 

We have calculated the metric tensor and the connec- 
tion coefficients from the premise that the cost function, 
by its special functional form, has a natural interpreta- 
tion as a Euclidean distance which induces a metric on 
the model manifold. Our approach is in the spirit of 
Bates and Watts' treatment of the subject (TB - Hsj . How- 
ever, the intrinsic properties of the model manifold can 
be calculated in an alternative way without reference to 
the embedding through the methods of Jeffreys, Rao and 
others (9j-[l3|. This approach is known as information 
geometry. We derive these quantities using information 
geometry in Appendix A. 

Given a vector in data space we are often interested in 
decomposing it into two components; one lying within the 
tangent space of the model manifold at a point and one 
perpendicular to the tangent space. For this purpose, we 
introduce the projection operators P T and P N which act 
on data-space vectors and project into the tangent space 
and its compliment respectively. From the Jacobian at a 
point on the manifold, these operators are 

P T = 5-P N = J{g- l )J T , (9) 

where S is the identity operator. It is numerically more 
accurate to compute these operators using the singular 
value decomposition of the Jacobian: 

P T = UU T . 

Turning to the problem of optimization, the parameter 
space picture leads one initially to follow the naive, gradi- 
ent descent direction, — V M C. An algorithm that moves 
in the gradient descent direction will decrease the cost 
most quickly for a given change in the parameters. If the 
cost contours form long narrow canyons, however, this 
direction is very inefficient; algorithms tend to zig-zag 
along the bottom of the canyon and only slowly approach 
the best fit pMj . 

In contrast, the model manifold defines an alterna- 
tive direction which we call the Gauss-Newton direction, 



which decreases the cost most efficiently for a change in 
the behavior. If one imagines sitting on the surface of the 
manifold, looking at the point representing the data, then 
the Gauss- Newton direction in data space is the point di- 
rected toward the data but projected onto the manifold. 
Thus, if v is the Gauss-Newton direction in data space, 
it is given by 

v = -P T r 
= -J(g-')J T r 
= -J{g- l )VC 

= -J^VuC, (10) 

where we have used the fact that VC = J T r. The com- 
ponents of the vector in parameter space, are related 
to the vector in data space through the Jacobian 

v = J^; (11) 

therefore, the direction in parameter space i> M that de- 
creases the cost most efficiently per unit change in be- 
havior is 

v" = -g» u V u C. (12) 

The term 'Gauss-Newton' direction comes from the 
fact that it is the direction given by the Gauss-Newton 
algorithm described in section IVIII Al Because the 
Gauss-Newton direction multiplies the gradient by the 
inverse metric, it magnifies motion along the sloppy di- 
rections. This is the direction that will move the pa- 
rameters along the canyon toward the best fit. The 
Gauss-Newton direction is purely geometric and will be 
the same in data space regardless of how the model is 
parametrized. The existence of the canyons are a conse- 
quence of bad parameterization on the manifold, which 
this parameter independent approach can help to rem- 
edy. Most sophisticated algorithms, such as conjugate 
gradient and Levenberg-Marquardt attempt to follow the 
Gauss-Newton direction as much as possible in order to 
not get stuck in the canyons. 

The obvious connection between sloppiness and the 
model manifold is through the metric tensor. For sloppy 
models, the metric tensor of the model manifold (the ap- 
proximate Hessian of Eq. (JSJ) has eigenvalues spread over 
many decades. This property is not intrinsic to the man- 
ifold however. In fact, one can always reparametrize the 
manifold to make the metric at a point any symmetric, 
positive definite matrix. This might naively suggest that 
sloppiness has no intrinsic geometric meaning, and that 
it is simply a result of a poor choice of parameters. The 
coordinate grid on the model manifold in data space is 
extremely skewed as in Figure [3J By reparametrizing, 
one can remove the skewedness and construct a more 
natural coordinate mesh. We will revisit this idea in 
section IVII We will argue in this manuscript that on 
the contrary, there is a geometrical component to sloppy 
nonlinear models that is independent of parameterization 
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FIG. 3: Skewed Coordinates. A sloppy model is charac- 
terized by a skewed coordinate mesh on the manifold. The 
volume of the parallel-piped is given by the determinant of 
the metric, which is equal to the product of the eigenvalues. 
Because sloppy models have many tiny eigenvalues, these vol- 
umes can be very small with extremely skewed coordinates. 
Our toy model has extremely skewed coordinates where the 
parameters are nearly equal (near the fold line). Most of the 
manifold is covered by regions where the coordinates are less 
skewed which corresponds to a very small region in parameter 
space. 



and in most cases that the human-picked 'bare' parame- 
ters naturally illuminate the sloppy intrinsic structure of 
the model manifold. 

In the original parameterization, sections of parame- 
ter space are mapped onto very tiny volumes of data 
space. We remind the reader that a unit volume of pa- 
rameter space is mapped into a volume of data space 
given by \f\g\. Because many eigenvalues are nearly zero 
for sloppy models, the model manifold necessarily occu- 
pies a tiny sliver of data space. In fact, if a region of 
parameter space has larger eigenvalues by even a small 
factor, the cumulative effect on the product is that this 
region of parameter space will occupy most of the model 
manifold. We typically find that most of the model man- 
ifold is covered by a very small region of parameter space 
which corresponds to the volumes of (slightly) less skewed 
meshes. 

We will see when we discuss curvature, that the large 
range of eigenvalues in the metric tensor usually corre- 
spond to a large anisotropy in the extrinsic curvature. 
Another geometric property of sloppy systems relates 
to the boundaries that the model imposes on the mani- 
fold. The existence of the boundaries for the toy model 
can be seen clearly in Fig. [TJ;. The surface drawn in 
the figure corresponds the patch of parameters within 
< 0i, #2 < oo. The three boundaries of the surface oc- 
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cur when the parameters reach their respective bounds. 
The one exception to this is the fold line, which corre- 
sponds to when the parameters are equal to one another. 
This anomalous boundary (6\ = 62) is called the fold 
line and is discussed further in section HVl Most nonlin- 
ear sloppy models have boundaries. 

In the next section we will discuss how boundaries 
arise on the model manifold and why they pose problems 
for optimization algorithms. Then, in section HVl we de- 
scribe another surface, the model graph, that removes the 
boundaries. The surface described by the model graph 
is equivalent to a model manifold with a linear Bayesian 
prior added as additional residuals. In section [V] we show 
that introducing other priors can be even more helpful for 
keeping algorithms away from the boundaries. 

III. BOUNDED MANIFOLDS 

Sloppiness is closely related to the existence of bound- 
aries on the model manifold. This may seem to be a 
puzzling claim because sloppiness has previously been 
understood to be a statement relating to the local lin- 
earization of model space. Here we will extend this idea 
and see that it relates to the global structure of the man- 
ifold and how it produces difficulties for the optimization 
process. 

To understand the origin of the boundaries on model 
manifolds, consider first the model of summing several 
exponentials 

tf(t,0) =$>-'*'■ 
n 

We restrict ourselves to considering only positive argu- 
ments in the exponentials, which limits the range of be- 
havior for each term to be between and 1. This restric- 
tion already imposes boundaries on the model manifold, 
but those boundaries become much more narrow as we 
consider the range the model can produce by holding just 
a few time points fixed. 

Fixing the output of the model at a few time points 
greatly reduces the values that the model can take on for 
all the remaining points. Fixing the values that the model 
takes on at a few data points is equivalent to considering 
a lower-dimensional cross section of the model manifold, 
as we have done in Fig. The boundaries on this cross 
section are very narrow; the corresponding manifold is 
long and thin. Clearly, an algorithm that navigates the 
model manifold will quickly run into the boundaries of 
this model unless it is actively avoiding them. 

In general, if a function is analytic, the results pre- 
sented in Fig. |4] are fairly generic, they come from gen- 
eral theorems governing the interpolation of functions. 
If a function is sampled at a sufficient number of time 
points to capture its major features, then the behavior of 
the function at times between the sampling can be pre- 
dicted with good accuracy by an interpolating function. 
For polynomial fits, as considered here, a function, f(t), 
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FIG. 4: Fixing a few data points greatly restricts the possi- 
ble range of the model behavior between those data points 
(lower). This is a consequence of interpolation of analytic 
functions. In this case, fit) is a sum of three exponentials 
with six parameters (amplitudes and rates). Shown above is 
a three dimensional slice of possible models plotted in data 
space, with the value of /(0) fixed to 1 and the value of /(l) 
fixed to 1/e. With these constraints we are left with a four di- 
mensional surface, meaning that the manifold of possible data 
shown here is indeed a volume. However, from a carefully cho- 
sen perspective (upper right) this volume can be seen to be 
extremely thin-in fact most of its apparent width is curvature 
of the nearly two dimensional sheet, evidenced by being able 
to see both the top (green) and bottom (black) simultane- 
ously. Generic aspects of this picture illustrate the difficulty 
of fitting nonlinear problems. Geodesies in this volume are 
just straight lines in three dimensions. Although the man- 
ifold seems to be only slightly curved, its extreme thinness 
means that geodesies travel very short distances before run- 
ning into model boundaries, necessitating the diagonal cutoff 
in Levenberg-Marquardt algorithms as well as the priors dis- 
cussed in section IV1 

sampled at n time points, (ti, t-2, t n ), can be fit ex- 
actly by a unique polynomial of degree n — 1, P n _i(t). 
Then at some interpolating point, i, the discrepancy in 
the interpolation and the function is given by 

m - Pn _ l{t) = ?wpl, (13) 

where f^ n \t) is the n-th derivative of the function and 
£ lies somewhere in the range t\ < £ < t n [39]. The 
polynomial oj(t) has roots at each of the interpolating 
points 

w(t) = (t-t 1 )(t-t 2 )...(t-t n ). 

By inspecting Eq. (flU]) . it is clear that the discrepancy 
between the interpolation and the actual function will be- 
come vanishingly small if higher derivatives of the func- 
tion do not grow too fast (which is the case for analytic 
functions) and if the sampling points are not too widely 
spaced (see Fig. [5]). 

The possible error of the interpolation function bounds 
the allowed range of behavior, Sf n , of the model at t 
after constraining the nearby n data points, which corre- 
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FIG. 5: The possible values of a model at intermediate time 
points are restricted by interpolating theorems. Taking cross 
sections of the model manifold corresponds to fixing the model 
values at a few time points, restricting the possible values at 
the remaining times. Therefore, the model manifold will have 
a hierarchy of progressively thinner widths, much like a hyper- 
ribbon. 



sponds to measuring cross sections of the manifold. Con- 
sider the ratio of successive cross sections, 

_ = ( t - tn+1 )(»+l)- R ^-, 



if 11 is sufficiently large, then 



therefore, we find that 



n+1 



t-U 



Sfn 



R 



R 1 



< 1 



by the ratio test. Each cross section is thinner than the 
last by a roughly constant factor A = St/R, predicting a 
hierarchy of widths on the model manifold. We describe 
the shape of a model manifold with such a hierarchy as 
a hyper-ribbon. We will now measure these widths for a 
few sloppy models and see that the predicted hierarchy 
is in fact present. 

As a first example, consider the sloppy model of fitting 
polynomials 



t" 



(14) 



If the parameters of the model are allowed to vary over 
all real values, then one can always fit M data points 
exactly with an (M — l) th degree polynomial. However, 
we wish to artificially restrict the range of the parameters 
to imitate the limited range of behavior characteristic 
of nonlinear models. A simple restriction is given by 
T^m @m — 1 ■ This constraint enforces the condition that 



higher derivatives of the function become small (roughly 
that the radius of convergence is one) and corresponds to 
the unit hyper-sphere in parameter space. If this function 
is sampled at time points (ti, t n ) then the model 
vector in data space can be written as 
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The matrix multiplying the vector of parameters is an 
example of a Vandermonde matrix. The Vandermondc 
matrix is known to be sloppy and, in fact, plays an im- 
portant role in the sloppy model universality class. The 
singular values of the Vandermonde matrix are what pro- 
duce the sloppy eigenvalue spectrum of sloppy models. 
Reference [8] shows that these singular values are indeed 
broadly spaced in log. For this model, the Vandermonde 
matrix is exactly the Jacobian. 

By limiting our parameter space to a hypersphere for 
the model in Eq. (fT4")h the corresponding model manifold 
is limited to a hyper-ellipse in data space. The principal 
axes of this hyper-ellipse are the eigenpredictions direc- 
tions we discussed in section [II] The lengths of the prin- 
cipal axes are the singular values. Consequently, there 
will be a hierarchy of progressively thinner boundaries 
on the model manifold due to the wide ranging singu- 
lar values of the Vandermonde matrix. For this model, 
the purely local property of the metric tensor eigenvalue 
spectrum is intimately connected to the global property 
of the boundaries and shape of the model manifold. 

As a second example, consider the model consisting of 
the sum of eight exponential terms, y — ^2 A^e -6 ^. 
We use log-parameters, rg M = log# M and ta^ — logA^, 
to make parameters dimensionless and enforce positiv- 
ity. We numerically calculate the several widths of the 
corresponding model manifold in Fig. where we see 
that they are accurately predicted by the singular values 
of the Jacobian. The widths in Fig. |6] were calculated 
by considering geodesic motion in each of the eigendi- 
rections of the metric from some point located near the 
center of the model manifold. We follow the geodesic mo- 
tion until it reaches a boundary; the length in data space 
of the geodesic is the width. Alternatively, we can choose 
M — N orthogonal unit vectors that span the space per- 
pendicular to the tangent plane at a point and a single 
unit vector given by a eigenprediction of the Jacobian 
which lies within the tangent plane. The M — N + 1 
dimensional hyper-plane spanned by these unit vectors 
intersects the model manifold along a one-dimensional 
curve. The width can be taken to be the length of that 
intersection. The widths given by these two methods are 
comparable. 

We can show analytically that our exponential fitting 
problem has model manifold widths proportional to the 
corresponding singular values of the Jacobian in the limit 
of a continuous distribution of exponents, 8^. using an 
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argument provided to us by Yoav Kallus. In this limit, 
the sum can be replaced by an integral, 

y(t) = J d6A{9)e- te =£{A(9)}, 

where the model is now the Laplace transform of the am- 
plitudes A{6). In this limit the data can be fit without 
varying the exponential rates, leaving only the linear am- 
plitudes as parameters. If we assume the data has been 
normalized according to y(t = 0) < 1, then it is natu- 
ral to consider the hyper-tetrahedron of parameter space 
given by by A n > and ^2 A n < 1. In parameter space, 
this tetrahedron has a maximum aspect ratio of a/2/M, 
but the mapping to data space distorts the tetrahedron 
by a constant Jacobian whose singular values we have 
seen to span many orders of magnitude. The resulting 
manifold thus must have a hierarchy of widths along the 
eigenpredictions equal to the corresponding eigenvalues 
within the relatively small factor y/2/M. 

As our third example, we consider a feed-forward ar- 
tificial neural network [40(. For computational ease, we 
choose a small network consisting of a layer of four in- 
put neurons, a layer of four hidden neurons, and an out- 
put layer of two neurons. We use the hyperbolic tangent 
function as our sigmoid function and vary the connection 
weights as parameters. As this model is not known to re- 
duce to a linear model in any limit, it serves as a test 
that the agreement for fitting exponentials is not special. 
Fig. (BJa shows indeed that the singular values of the Ja- 
cobian agree with geodesic widths again for this model. 

The results in Fig. [5] is one of our main results and 
requires some discussion. Strictly speaking, the singular 
values of the Jacobian have units of data space distance 
per unit parameter space distance, while the units of the 
widths are data space distance independent of param- 
eters. In the case of the exponential model, we have 
used log-parameters, making the parameters dimension- 
less. In the neural network, the parameters are the con- 
nection weights whose natural scale is one. In general, 
the exact agreement between the singular values and the 
widths may not agree if the parameters utilize different 
units or have another natural scale. One must note, how- 
ever, that the enormous range of singular values implies 
that the units would have to be radically different from 
natural values to lead to significant distortions. 

Additionally, the two models presented in Fig. [5] are 
particularly easy to fit to data. The fact that from a 
centrally located point, geodesies can explore nearly the 
entire range of model behavior suggests that the bound- 
aries are not a serious impediment to the optimization. 
For more difficult models, such as the PC12 model in sys- 
tems biology 0, we find that the the widths estimated 
from the singular values and from geodesic motion dis- 
agree. The geodesic widths are much smaller than the 
singular value estimates. In this case, although the spac- 
ing between geodesic widths is the same as the spacing 
between the singular values, they are smaller by several 
orders of magnitude. We believe that most typical start- 
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FIG. 6: a) Geodesic cross-sectional widths of an eight dimen- 
sional model manifold along the eigendirections of the metric 
from some central point, together with the square root of the 
eigenvalues (singular values of the Jacobian) [23 |. Notice the 
hierarchy of these data-space distances - the widths and sin- 
gular values each spanning around four orders of magnitude. 
To a good approximation, the cross-sectional widths are given 
by singular values. In the limit of infinitely many exponential 
terms, this model becomes linear, b) Geodesic cross-sectional 
widths of a feed-forward artificial neural network. Once again, 
the widths nicely track the singular values. 

ing points of this model lie near a hyper-corner of the 
model manifold. If this is the case, then geodesies will be 
unable to explore the full range of model behavior with- 
out reaching a model boundary. We argue later in this 
section that this phenomenon is one of the main difficul- 
ties in optimization, and in fact, we find that the PC12 
model is a much more difficult fitting problem than either 
the exponential or neural network problem. 

We have seen that sloppiness is the result of skewed 
coordinates on the model manifold, and we will argue 
later in section [VTI that algorithms are sluggish as a re- 
sult of this poor parameterization. Fig.[6]tells us that the 
'bare' model parameters are not as perverse as one might 
naively have thought. Although the bare-parameter di- 
rections are inconvenient for describing the model behav- 
ior, the local singular values and eigenpredictions of the 
Jacobian are useful estimates of the model's global shape. 
The fact that the local stiff and sloppy directions coincide 
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with the global long and narrow directions is a nontrivial 
result that seems to hold for most models. 

To complete our description of a typical sloppy model 
manifold requires a discussion of curvature, which we 
postpone until section lVII Dl We will see that in addition 
to a hierarchy of boundaries, the manifold typically has 
a hierarchy of extrinsic and parameter-effects curvatures 
whose scales are set by the smallest and widest widths 
respectively. 

We argue elsewhere f22|, that the ubiquity of sloppy 
models, appearing everywhere from models in systems 
biology [6], insect flight [8], variational quantum wave 
functions, inter-atomic potentials (4l| . and a model of 
the next-generation international linear collider [7|, im- 
plies that a large class of models have very narrow bound- 
aries on their model manifolds. The interpretation that 
multiparameter fits are a type of high-dimensional an- 
alytic interpolation scheme, however, also explains why 
so many models are sloppy. Whenever there are more 
parameters than effective degrees of freedom among the 
data points, then there are necessarily directions in pa- 
rameter space that have a limited effect on the model 
behavior, implying the metric must have small eigenval- 
ues. Because successive parameter directions have a hi- 
erarchy of vanishing effect on model behavior, the metric 
must have a hierarchy of eigenvalues. 

We view most multiparameter fits as a type of multi- 
dimensional interpolation. Only a few stiff parameter 
combinations need to be tuned in order to find a rea- 
sonable fit. The remaining sloppy degrees of freedom do 
not alter the fit much, because they fine tune the inter- 
polated model behavior, which, as we have seen, is very 
restricted. This has important consequences for inter- 
preting the best fit parameters. One should not expect 
the best fit parameters to necessarily represent the phys- 
ical values of the parameters, as each parameter can be 
varied by many orders of magnitude along the sloppy di- 
rections. Although the parameter values at a best fit 
cannot typically be trusted, one can still make falsifiablc 
predictions about model behavior without knowing the 
parameter values by considering an ensemble of parame- 
ters with reasonable fits pH!, [||. 

For our fitting exponential example, part of the model 
boundary was the 'fold lines' where pairs of the exponents 
are equal (see Fig. [!}. No parameters were at extreme 
values, but the model behavior was nonetheless singu- 
lar. Will such internal boundaries arise generically for 
large nonlinear models? Model boundaries correspond 
to points on the manifold where the metric is singular. 
Typical boundaries occur when parameters are near their 
extreme values (such as ±oo or zero), where the model 
becomes unresponsive to changes in the parameters. For- 
mally, a singularity will occur if the basis vectors on the 
model manifold given by = d^f are linearly depen- 
dent, which is to say there exist a set of nonzero o^'s for 
which 

a% = 0. (16) 



In order to satisfy Eq. (fH)| we may vary 2N parame- 
ters (the N values of a M plus the N parameters of the 
model) to satisfy M equations. Therefore if M < 2N 
there will exist nontrivial singular points of the metric at 
non-extreme values of the parameters. 

For models with M > 2N, we do not expect Eq. (|T5j) to 
be exactly satisfied generically except at extreme values 
of the parameters when one or more of the basis vec- 
tors vanish, e M = 0. However, many of the data points 
are interpolating points as we have argued above, and 
we expect qualitatively to be able to ignore several data 
points without much information loss. In general, we ex- 
pect that Eq. (|16p could be satisfied to machine precision 
at nontrivial values of the parameters even for relatively 
small N. 

Now that we understand the origin of boundaries on 
the model manifold, we can discuss why they are prob- 
lematic for the process of optimization. It has been ob- 
served in the context of training neural networks, that 
metric singularities (i.e. model boundaries) can have a 
strong influence on the fitting [42j. More generally, the 
process of fitting a sloppy model to data involves the frus- 
trating experience of applying a black box algorithm to 
the problem which appears to be converging, but then 
returns a set of parameters that does not fit the data 
well and includes parameter values that are far from any 
reasonable value. We refer to this drift of the parameters 
to extreme values as parameter evaporation [7(| • This 
phenomenon is troublesome not just because it causes 
the algorithm to fail. Often, models are more compu- 
tationally expensive to evaluate when they are near the 
extreme values of their parameters. Algorithms will of- 
ten not just fail to converge, but they will take a long 
time in the process. 

After an algorithm has failed and parameters have 
evaporated, one may resort to adjusting the parameter 
values by hand and then reapplying the algorithm. Hope- 
fully, iterating this process will lead to a good fit. Even 
if one eventually succeeds in finding a good fit, because 
of the necessity of adjusting parameters by hand, it can 
be a long and boring process. 

Parameter evaporation is a direct consequence of the 
boundaries of the model manifold. To understand this, 
recall from section [TT] that the model manifold defines a 
natural direction, the Gauss-Newton direction, that most 
algorithms try to follow. The problem with blindly fol- 
lowing the Gauss-Newton direction is that it is purely 
local and ignores the fact that sloppy models have bound- 
aries. Consider our example model; the model manifold 
has boundaries when the rates become infinite. If an ini- 
tial guess has over-estimated or under-estimated the pa- 
rameters, the Gauss-Newton direction can point toward 
the boundary of the manifold, as does fit A in Fig. [7] 
If one considers the parameter space picture, the Gauss- 
Newton direction is clearly nonsensical, pointing away 
from the best fit. Generally, while on a plateau region, 
the gradient direction is better at avoiding the manifold 
boundaries. However, nearer the best fit, the boundary is 
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FIG. 7: a) Falling off the edge of the model mani- 
fold. The manifold in data space defines a "natural" direction, 
known as the Gauss-Newton direction, in which an algorithm 
will try to follow to the best fit. Often this direction will 
push parameters toward the edge of the manifold, b) Gradi- 
ent and Gauss-Newton directions in Parameter space. 
The manifold edge corresponds to infinite values of the pa- 
rameters. Following the Gauss-Newton direction to the edge 
of the manifold will cause parameters to evaporate while on 
the plateau. While in a canyon, however, the Gauss-Newton 
direction gives the most efficient direction to the best fit. 



less important and the Gauss-Newton direction is much 
more efficient than the downhill direction, as is the case 
for fit B in Fig. 

Since the model manifold typically has several narrow 
widths, it is reasonable to expect that a fit to noisy data 
will evaporate many parameters to their limiting values 
(such as oo or zero), as we explore in section fVII Gl We 
therefore do not want to prevent the algorithm from evap- 
orating parameters altogether. Instead, we want to pre- 
vent the algorithm from prematurely evaporating param- 
eters and becoming stuck on the boundary (or lost on the 
plateau). Using the two natural directions to avoid the 
manifold boundaries while navigating canyons to the best 
fit is at the heart of the difficulty in optimizing sloppy 
models. Fortunately, there exists a natural interpolation 
between the two pictures which we call the model graph 
and is the subject of the next section. This natural in- 
terpolation is exploited by the Levenberg-Marquardt al- 
gorithm, which we discuss in section I Villi 



IV. THE MODEL GRAPH 

We saw in Section IIIII that the geometry of sloppiness 
explains the phenomenon of parameter evaporation as 
algorithms push parameters toward the boundary of the 
manifold. However, as we mentioned in Section [Til the 
model manifold picture is a view complementary to the 
parameter space picture, as illustrated in Fig. [TJ 

The parameter space picture has the advantage that 
boundaries typically do not exist (i.e. they lie at param- 
eter values equal to oo). If model boundaries occur for 
parameter values that are not infinite, but are otherwise 
unphysical, for example, 8 = for our toy model, it is 
helpful to change parameters in such a way as to map 
these boundaries to infinity. For the case of summing 
exponentials, it is typical to work in log#, which puts 
all boundaries at infinite parameter values and has the 
added bonus of being dimensionless (avoiding problems of 
choice of units). In addition to removing boundaries, the 
parameter space does not have the complications from 
curvature; it is a flat, Euclidean space. 

The disadvantage of the parameter space picture is 
that motion in parameter space is extremely disconnected 
from the behavior of the model. This problem arises as 
an algorithm searches the plateau looking for the canyon 
and again when it follows the winding canyon toward the 
best fit. 

The model manifold picture and the parameter space 
picture can be combined to utilize the strengths of both 
approaches. This combination is called the model graph 
because it is the surface created by the graph of the 
model, i.e. the behavior plotted against the parameters. 
The model graph is an N dimensional surface embedded 
in an M + N dimensional Euclidean space. The embed- 
ding space is formed by combining the M dimensions of 
data space with the N dimensions of parameter space. 
The metric for the model graph can be seen to be 

<W = 9% + MV, (17) 

where q?„. = (J T T) is the metric of the model manifold 
and -D M „ is the metric of parameters space. We discuss 
common parameter space metrics below. We have intro- 
duced the free parameter A in Eq. (fTT|) which gives the 
relative weight of the parameter space metric to the data 
space metric. Most of the work in optimizing an algo- 
rithm comes from a suitable choice of A, known as the 
damping parameter or the Levenberg-Marquardt param- 
eter. 

If Z? M „ is the identity, then we call the metric in 
Eq. (|17|) the Levenberg metric because of its role in the 
Levenberg algorithm j43|. Another possible choice for 
D^ u is to populate its diagonal with the diagonal el- 
ements of g^ v while leaving the off-diagonal elements 
zero. This choice appears in the Levenberg-Marquardt 
algorithm [44[ and has the advantage that the resulting 
method is invariant to rescaling the parameters, e.g. it is 
independent of units. It has the problem, however, that if 
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a parameter evaporates then its corresponding diagonal 
element may vanish and the model graph metric becomes 
singular. To avoid this dilemma, one often chooses D to 
have diagonal elements given by the largest diagonal el- 
ement of g° yet encountered by the algorithm [45] . This 
method is scale invariant but guarantees that D is always 
positive definite. We discuss these algorithms further in 
section IVIII1 

It is our experience that the Marquardt metric is 
much less useful than the Levenberg metric for prevent- 
ing parameter evaporation. While it may seem counter- 
intuitive to have a metric (and by extension an algorithm) 
that is sensitive to whether the parameters are measured 
in inches or miles, we stress that the purpose of the model 
graph is to introduce parameter dependence to the man- 
ifold. Presumably, the modeler is measuring parameters 
in inches because inches are a more natural unit for the 
model. By disregarding that information, the Marquardt 
metric is losing a valuable sense of scale for the parame- 
ters and is more sensitive to parameter evaporation. The 
concept of the natural units will be important in the dis- 
cussion of priors in section [V] On the other hand, the 
Marquardt method is faster at following a narrow canyon 
and the best choice likely depends on the particular prob- 
lem. 

If the choice of metric for the parameter space is con- 
stant, d a D^ = 0, then the connection coefficients of the 
model graph (with all lowered indices) are the same as 
for the model manifold given in Eq. (jSJ. The connec- 
tion with a raised index will include dependence on the 
parameter space metric: 

Kp = (9- 1 Y v Y, d » r ™ d <* d e r ™ 

m 

where g is given by Eq. (jTTJ) . 

By considering the model graph instead of the model 
manifold, we can remove the problems associated with 
the model boundaries. We return to our example prob- 
lem to illustrate this point. The embedding space for 
the model graph is 3 + 2 = 5 dimensional, so we are re- 
stricted to viewing 3 dimensional projections of the em- 
bedding space. In Fig. [5] we illustrate the model graph 
(Levenberg metric) for A = 0, which is simply the model 
manifold, and for A 7^ 0, which shows that boundaries of 
the model manifold are removed in the graph. Since the 
boundaries occur at 9 = 00, they are infinity far from 
the origin on the model graph. Even the boundary cor- 
responding to the fold line has been removed, as the fold 
has opened up like a folded sheet of paper. Since generic 
boundaries correspond to singular points of the metric, 
the model graph has no such boundaries as its metric is 
positive definite for any A > 0. 

After removing the boundaries associated with the 
model manifold, the next advantage of the model graph 
is to provide a means of seamlessly interpolating between 
the natural directions of both data space and parameter 
space. The damping term, A, appearing in Eq. (I17p is well 
suited for this interpolation in sloppy models. If we con- 




FIG. 8: The effect of the damping parameter is to produce a 
new metric for the surface induced by the graph of the model 
versus the input parameters, (a) Model Graph, A = 0. If 
the parameter is zero, then the resulting graph is simply the 
original model manifold, with no extent in the parameter di- 
rections. Here we see a flat two dimensional cross section; 
the z-axis is a parameter value multiplied by ■s/X — 0. (b) 
Model Graph A / 0. If the parameter is increased, the sur- 
face is "stretched" into a higher dimensional embedding space. 
This is an effective technique for removing the boundaries, as 
no such boundary exists in the model graph. However, this 
comes at a cost of removing the geometric connection be- 
tween the cost function and the structure of the surface. For 
very large damping parameters, the model graph metric be- 
comes a multiple of the parameter space metric, which rotates 
the Gauss-Newton direction into the gradient direction. The 
damping term therefore interpolates between the parameter 
space metric and the data space metric. 

sider the Levenberg metric, the eigenvectors of the model 
manifold metric, g°, are unchanged by adding a multiple 
of the identity. However, the corresponding eigenvalues 
are shifted by the A parameter. It is the sloppy eigenval- 
ues that are dangerous to the Gauss-Newton direction. 
Since the eigenvalues of a sloppy model span many or- 
ders of magnitude, this means that all the eigenvalues 
that were originally less than A are cutoff at A in the 
model graph metric, and the larger eigenvalues are vir- 
tually unaffected. By adjusting the damping term, we 
can essentially wash out the effects of the sloppy direc- 
tions and preserve the Gauss-Newton direction from the 
model manifold in the stiff directions. Since the eigen- 
values span many orders of magnitude, the parameter 
does not need to be finely tuned; it can be adjusted very 
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roughly and an algorithm will still converge, as we will 
see in section [Villi We demonstrate how A can interpo- 
late between the two natural directions for our example 
model in Fig. [51 



V. PRIORS 

In Bayesian statistics, a prior is an a-priori prob- 
ability distribution in parameter space, giving infor- 
mation about the relative probability densities for 
the model as parameters are varied. For exam- 
ple, if one has pre-existing measurements of the pa- 
rameters 6 m = 8^ ± a m with normally distributed 
uncertainties, then the probability density would be 
Ilm l/v/^exp [-(0 m - C) 2 /(2^ 2 „] before fitting to 
the current data. This corresponds to a negative-log- 
likclihood cost that (apart from an overall constant) is 
the sum of squares, which can be nicely interpreted as 
the effects of an additional set of "prior residuals" 

r m = (dm - 6° m )/a m (18) 

(interpreting the pre-existing measurements as extra data 
points). In this section, we will explore the more general 
use of such extra terms, not to incorporate information 
about parameter values, but rather to incorporate infor- 
mation about the ranges of parameters expected to be 
useful in generating good fits. 

That is, we want to use priors to prevent parame- 
ter combinations which are not constrained by the data 
from taking excessively large values - we want to avoid 
parameter evaporation. To illustrate again why this is 
problematic in sloppy models, consider a linear sloppy 
model with true parameters 6$, but fit to data with 
added noise The observed best fit is then shifted to 
9 = 6q + (J T J) _1 (J )£. The measurement error in data 
space £i is thus multiplied by the inverse of the poorly 
conditioned matrix g — J T J, so even a small measure- 
ment error produces a large parameter-space error. In 
section [VII Gl we will see in nonlinear models that such 
noise will generally shift the best fits to the boundary (in- 
finite parameter values) along directions where the noise 
is large compared to the width of the model manifold. 
Thus for example in fitting exponentials, positive noise 
in the data point at to = and negative noise at the 
data point at the first time t\ > can lead to one decay 
rate that evaporates to infinity, tuned to fit the first data 
point without affecting the others. 

In practice, it is not often useful to know that the opti- 
mum value of a parameter is actually infinite - especially 
if that divergence is clearly due to noise. Also, we have 
seen in Fig. [7K that, even if the best fit has sensible pa- 
rameters, algorithms searching for the best fits can be 
led toward the model manifold boundary. If the param- 
eters are diverging at finite cost, the model must nec- 
essarily become insensitive to the diverging parameters, 
often leading the algorithm to get stuck. Even a very 



weak prior whose residuals diverge at the model mani- 
fold boundaries can prevent these problems, holding the 
parameters in ranges useful for fitting the data. 

In this section, we advocate the use of priors for help- 
ing algorithms navigate the model manifold in finding 
good fits. These priors are pragmatic; they are not in- 
troduced to buffer a model with 'prior knowledge' about 
the system, but to use the data to guess the parameter 
ranges outside of which the fits will become insensitive to 
further parameter changes. Our priors do not have mean- 
ing in the Bayesian sense, and indeed should probably be 
relaxed to zero at late stages in the fitting process. 

The first issue is how to guess what ranges of parame- 
ter are useful in fits - outside of which the model behavior 
becomes insensitive to the parameter values. Consider, 
for example, the Michaelis-Mentin reaction, a saturable 
reaction rate often arising in systems biology (for exam- 
ple Reference [H): 

4x*] = k x [y*][x] , ig , 
dt 1 + km x [x] 

Here there are two parameters k x and km x , governing 
the rate of production of [x*] from [x] in terms of the 
concentration [y*], where [x] + [x*] = x ma x and [y] + [y*] = 

Vraax • 

Several model boundaries can be identified here. If k x 
and km x x m ax are both very large, then only their ratio 
affects the dynamics. In addition if km x is very small 
then it has no effect on the model. Our prior should en- 
force our belief that ^[1] is typically of order 1. If it 
were much larger than one, than we could have modeled 
the system with one less parameter k = k x /km x and if it 
were much less than one, the second term in the denom- 
inator could have been dropped entirely. Furthermore, 
if the data is best fit by one of these boundary cases, 
say km x x max — > 00 , it will be fit quite well by taking 
km x x max » 1, but otherwise finite. In a typical model 
we might expect that km x x max — 10 will behave as if it 
were infinite. 

We can also place a prior on k x . Dimensional analysis 
here involves the time scale at which the model is predic- 
tive. The prior should match the approximate time scale 
of the model's predictions to the rate of the modeled re- 
action. For example, if an experiment takes time series 
data with precision on the order of seconds with intervals 
on the order minutes, then a 'fast' reaction is any that 
takes place faster than a few seconds and a slow reaction 
is any that happens over a few minutes. Even if the real 
reaction happens in microseconds, it makes no sense to 
extract such information from the model and data. Sim- 
ilarly, a slow reaction that takes place in years could be 
well fit by any rate that is longer than a few minutes. 
As such we want a prior which prevents k x y max x max /T 
from being far from 1, where t is the typical timescale of 
the data, perhaps a minute here. In summary, we want 
priors to constrain both km x x max and k x x max y max / t to 
be of order one. 
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FIG. 9: (A) Gauss-Newton Directions. The Gauss- Newton direction is prone to pointing parameters toward infinity, espe- 
cially in regions where the metric has very small eigenvalues. (B) Rotated Gauss-Newton Directions. By adding a small 
damping parameter to the metric, the Gauss-Newton direction is rotated into the gradient direction. The amount of rotation is 
determined by the eigenvalues of the metric at any given point. Here, only a few points are rotated significantly. (C) Gradient 
Directions. For large values of the damping parameter, the natural direction is rotated everywhere into the gradient direction. 



We have found that a fairly wide range of priors can 
be very effective at minimizing the problems associated 
with parameter evaporation during fitting. To choose 
them, we propose starting by changing to the natural 
units of the problem by dividing by constants, such as 
time scales or maximum protein concentrations, until all 
of the parameters are dimensionless. (Alternatively, pri- 
ors could be put into the model in the original units, at 
the expense of more complicated book-keeping.) In these 
natural units we expect all parameters to be order 1. 

The second issue is to choose a form for the prior. For 
parameters like these, where both large and near-zero 



values are to be avoided, we add two priors for every pa- 
rameter, one which punishes high values, and one which 
punishes small values: 



Pr{9) = 




This prior has minimum contribution to the cost when 
^ 2 = uT so m ^ ne P ro P er units we choose Wh =Wi. With 
these new priors, the metric becomes 

9^ = d^dvr 01 + d»Pr{6)d v Pr{6) (21) 
= & + WS (22) 
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which is positive definite for all (positive) values of 0. 
As boundaries occur when the metric has an eigenvalue 
of zero, no boundaries exist for this new model manifold. 
This is reminiscent of the metric of the model graph with 
the difference being that we have permanently added this 
term to the model. The best fit has been shifted in this 
new metric. 

It remains to choose Wh and wi. Though the choice is 
likely to be somewhat model specific, we have found that 
a choice between .001 and 1 tends to be effective. That 
weights of order 1 can be effective is somewhat surpris- 
ing. It implies that good fits can be found while punish- 
ing parameters for differing only an order of magnitude 
from their values given by dimensional analysis. That 
this works is a demonstration of the extremely ill-posed 
nature of these sloppy models, and the large ensemble of 
potential good fits in parameter space. 

A complimentary picture of the benefit of priors takes 
place in parameter space, where they contribute to the 
cost: 



C = C o +J2wh0i/2 + wil '(20*). 



(23) 



The second derivative of the extra cost contribution 
with respect to the log of the parameters is given by 

ai0 g 2 (9)a (^4^-) = ^ + w- This is Positive definite 
and concave, making the entire cost surface large when 
parameters are large. This in turn makes the cost surface 
easier to navigate by removing the problems associated 
with parameter evaporation on plateaus. 

To demonstrate the effectiveness of this method, we use 
the PC12 model with 48 parameters described in We 
change to dimensionless units as described above. To cre- 
ate an ensemble, we start from 20 initial conditions, with 
each parameter taken from a Gaussian distribution in its 
log centered on (the expected value from dimensional 
analysis), with a a — log 10 (so that the bare parame- 
ters range over roughly two orders of magnitude from .1 
to 10). We put a prior as described above centered on 
the initial condition, with varying weights. These corre- 
spond to the priors that we would have calculated if we 
had found those values by dimensional analysis instead. 
After minimizing with the priors, we remove them and 
allow the algorithm to re-minimize. The results are plot- 
ted in Fig. Q3H 

Strikingly, even when a strong prior is centered at pa- 
rameter values a factor of ~ 100 away from their 'true' 
values, the addition of the prior in the initial stages of 
convergence dramatically increases the speed and success 
rate of finding the best fit. 

In section IIV1 we introduced the model graph and 
the Levenberg-Marquardt algorithm, whose rationale (to 
avoid parameter evaporation) was similar to that moti- 
vating us here to introduce priors. To conclude this sec- 
tion, we point out that the model graph metric, Eq. (|17p. 
and the metric for our particular choice of prior, Eq. (I22p . 
both serve to cut off large steps along sloppy direc- 
tions. Indeed, the Levenberg-Marquardt algorithm takes 
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Jacobian Evaluations 

FIG. 10: The final cost is plotted against number of Jacobian 
evaluations for five strengths of priors. After minimizing with 
priors, the priors are removed and a maximum of 20 further 
Jacobian evaluations are performed. The prior strength is 
measured by p, with p — meaning no prior. The success 
rate is R. The strongest priors converge the fastest, with 
medium strength priors showing the highest success rate. 



a step identical to that for a model with quadratic priors 
(Eq. I|18p) with cr m = 1/vA, except that the center of the 
prior is not a fixed set of parameters 9 , but the current 
parameter set 9* . (That is, the second derivative of the 

sum of the squares of these residuals, X) m [v / A(^ — (?*)] 2 
gives XSfw, the Levenberg term in the metric.) This Lev- 
enberg term thus acts as a 'moving prior' - acting to 
limit individual algorithmic steps from moving too far 
toward the model boundary, but not biasing the algo- 
rithm permanently toward sensible values. Despite the 
use of a variable A that can be used to tune the algorithm 
toward sensible behavior (Fig. [9]), we shall see in sec- 
tion [VlTT] that the Levenberg-Marquardt algorithm often 
fails, usually because of parameter evaporation. When 
the useful ranges of parameters can be estimated before- 
hand, adding priors can be a remarkably effective tool. 



VI. EXTENDED GEODESIC COORDINATES 

We have seen that the two difficulties of optimizing 
sloppy models are that algorithms tend to run into the 
model boundaries and that model parametrization tends 
to form long, curved canyons around the best fit. We 
have discussed how the first problem can be improved by 
the introduction of priors. We now turn our attention 
to the second problem. In this section we consider the 
question of whether we can change the parameters of a 
model in such a way as to remove this difficulty. We 
construct coordinates geometrically by considering the 
motion of geodesies on the manifold. 

Given two nearby points on a manifold, one can con- 
sider the many paths that connect them. If the points 
are very far away, there may be complications due to the 
boundaries of the manifold. For the moment, we assume 
that the points are sufficiently close that boundaries can 



16 



be ignored. The unique path joining the two points whose 
distance is shortest is known as the geodesic. The param- 
eters corresponding to a geodesic path can be found as 
the solution of the differential equation 

sr + r^x a x? = o, (24) 

where are the connection coefficients given by Eq. © 
and the dot means differentiation with respect to the 
curve's affine parametrization. Using two points as 
boundary values, the solution to the differential equa- 
tion is then the shortest distance between the two points. 
Alternatively, one can specify a geodesic with an initial 
point and direction. In this case, the geodesic is inter- 
preted as the path drawn by parallel transporting the 
tangent vector (also known as the curve's velocity). This 
second interpretation of geodesies will be the most use- 
ful for understanding the coordinates we are about to 
construct. The coordinates that we consider are polar- 
like coordinates, with N ~ 1 angular coordinates and one 
radial coordinate. 

If we consider all geodesies that pass through the best 
fit with a normalized velocity, v^v^ = 1, then each 
geodesic is identified by N — 1 free parameters, corre- 
sponding to direction of the velocity at the best fit. (The 
normalization of the velocity does not change the path 
of the geodesic - only the time it takes to traverse the 
path.) These N — 1 free parameters will be the angu- 
lar coordinates of the new coordinate system. There is 
no unique way of defining the angular coordinates. One 
can choose N orthonormal unit vectors at the best fit, 
and let the angular coordinates define a linear combi- 
nation of them. We typically choose eigendirections of 
the metric (the eigenpredictions of section [TT|) . Having 
specified a geodesic with the N — 1 angular coordinates, 
the radial coordinate represents the distance moved along 
the geodesic. Since we have chosen the velocity vector 
to be normalized to one, the radial component is the 
parametrization of the geodesic. 

We refer to these coordinates as extended geodesic 
coordinates and denote their Cartesian analog by 7 M . 
These coordinates have the special property that those 
geodesies that pass through the best fit appears as 
straight lines in parameter space. (It is impossible for 
all geodesies to be straight lines if the space is curved.) 

In general, one cannot express this coordinate change 
in an analytic form. The quadratic approximation to this 
transformation is given by 

Y « 6 v bf + ^86" + l -T^50 a 5ee. (25) 

The coordinates given in Eq. (|25jl are known as Riemann 
normal coordinates or geodesic coordinates. Within 
the general relativity community, these coordinates are 
known as locally inertial reference frames because they 
have the property that r" (a; = 0) = 0, that is, the 
Christoffel symbols vanish at the special point around 
which the coordinates are constructed 1351 . 




FIG. 11: a) Extended Geodesic Coordinates. The pa- 
rameters of a model are not usually well suited to describ- 
ing the behavior of a model. By considering the manifold 
induced in data space, one can construct more natural coor- 
dinates based on geodesic motion that are more well-suited 
to describing the behavior of a model (black grid). These 
coordinates remove all parameter-effects curvature and are 
known as extended geodesic coordinates. Note that we have 
moved the data point so that the best fit is not so near a 
boundary in this picture, b) Cost Contours in Extended 
Geodesic Coordinates. Although the summing exponen- 
tial model is nonlinear, that non-linearity does not translate 
into large extrinsic curvature. This type of non-linearity is 
known as parameter-effects curvature, which the geodesic co- 
ordinates remove. This is most dramatically illustrated by 
considering the contours of constant cost in geodesic coordi- 
nates. The contours are nearly circular all the way out to the 
fold line and the boundary where the rates are infinite. 

Let us now consider the shape of cost contours for our 
example model using extended geodesic coordinates. We 
can consider both the shape of the coordinate mesh on 
the manifold in data space, as well as the shape of the cost 
contours in parameter space. To illustrate the dramatic 
effect that these coordinates can have, we have adjusted 
the data so that the best fit does not lie so near the 
boundary. The results are in Fig. [IT] 

The extended geodesic coordinates were constructed to 
make the elongated ellipse that is characteristic of sloppy 
models become circular. It was hoped that by making the 
transformation nonlinear, it would straighten out the an- 
harmonic "banana" shape, rather than magnify it. It ap- 
pears that this wish has been granted spectacularly. Not 
only has the banana been straightened out within the re- 
gion of the long narrow canyon, but the entire region of 
parameter space, including the plateau, has been trans- 
formed into one manageable, isotropic basin. Indeed, the 
cost contours of Fig. [TTb are near-perfect circles, all the 
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way to the boundary where the rates go to zero, infinity, 
or are equal. 

To better understand how this elegant result comes 
about, let's consider how the cost changes as we move 
along a geodesic that passes through the best fit. The 
cost then becomes parametrized by the same parameter 
describing the geodesic, which we call r. The chain rule 
gives us, 

dr ~ dr d0» ^ 
where v 11 — 9^. Applying this twice to the cost gives: 

= vVg ^ + rm pN ndfidurn ^^ (26) 

The term v^v v g^u in Eq. (|26p is the arbitrarily chosen 
normalization of the velocity vector and is the same at 
all points along the geodesic. The interesting piece in 
Eq. (T25|) is the expression 

p n = s-j(j T jy 1 j t , 

which we recognize as the projection operator that 
projects out of the tangent space (or into the normal 
bundle) . 

Recognizing P N in Eq. (f2"o| , we see that any deviation 
of the quadratic behavior of the cost will be when the 
non-linearity forces the geodesic out of the tangent plane, 
which is to say that there is an extrinsic curvature. When 
there is no such curvature, then the cost will be isotropic 
and quadratic in the extended geodesic coordinates. 

If the model happens to have as many parameters as 
residuals, then the tangent space is exactly the embed- 
ding space and the model will be flat. This can be seen 
explicitly in the expression for P N , since J will be a 
square matrix if M = N, with a well-defined inverse: 

p n = 5~j(j T jy\j T 

= 0. 



manifold is typically much more flat that one would have 
guessed. 

Non-linearities that do not produce extrinsic curva- 
ture are described as parameter-effects curvature [l5| . As 
the name suggests these are "curvatures" that can be re- 
moved through a different choice of parameters. By us- 
ing geodesies, we have found a coordinate system on the 
manifold that removes all parameter-effects curvature at 
a point. It has been noted previously that geodesies are 
linked to zero parameter-effects curvature [46|. 

We believe it to be generally true for sloppy mod- 
els that non-linearities are manifested primarily as 
parameter-effects curvature as we argue in S22l | and in 
section I VIII We find similar results when we consider 
geodesic coordinates in the PC12 model, neural networks, 
and many other models. Just as for the summing expo- 
nential problem that produced Fig.fTTb. cost contours for 
this real-life model are nearly circular all the way to the 
model's boundary. 

Although the model manifold is much more flat than 
one would have guessed, how does that result compare 
for the model graph? We observed in section IIV1 that 
the model graph interpolates between the model mani- 
fold and the parameter space picture. If we find the cost 
contours for the model graph at various values of A, we 
can watch the cost contours interpolate between the cir- 
cles in Fig.lllb and the long canyon that is characteristic 
of parameter space. This can be seen clearly in Fig. [T3] 

With any set of coordinates, it is important to know 
what portion of the manifold they cover. Extended 
geodesic coordinates will only be defined in some region 
around the best fit. It is clear from Fig. QT]that for our 
example problem the region for which the coordinates are 
valid extends to the manifold boundaries. Certainly there 
are regions of the manifold that are inaccessible to the 
geodesic coordinates. Usually, extended geodesic coordi- 
nates will be limited by geodesies reaching the bound- 
aries, just as algorithms are similarly hindered in finding 
the best fit. 



VII. CURVATURE 



Furthermore, when there are as many parameters as 
residuals, the extended geodesic coordinates can be cho- 
sen to be the residuals themselves, and hence the cost 
contours will be concentric circles. 

In general, there will be more residuals than param- 
eters; however, we have seen in section IIIII that many 
of those residuals are interpolating points that do not 
supply much new information. Assuming that we can 
simply discard a few residuals, then we can "force" the 
model to be flat by restricting the embedding space. It is, 
therefore, likely that for most sloppy models, the man- 
ifold will naturally be much more flat than one would 
have expected. We will see when we discuss curvature in 
section IVIII that most of the non-linearities of a sloppy 
model do not produce extrinsic curvature, meaning the 



In this section, we discuss the various types of curva- 
ture that one might expect to encounter in a least-squares 
problem and the measures that could be used to quan- 
tify those curvatures. Curvature of the model manifold 
has had many interesting applications. It has been illus- 
trated by Bates and Watts that the curvature is a conve- 
nient measure of the non-linearity of a model [IH, [TH, [l8[ . 
When we discuss the implications of geometry on nu- 
merical algorithms this will be critical, since it is the 
non-linearity that makes these problems difficult. 

Curvature has also been used to study confidence re- 
gions fl6l . [20L [4T 49] . kurtosis (deviations from normality) 
in parameter estimation [50] , and criteria for determining 
if a minimum is the global minimizer [51) . We will see 
below that the large anisotropy in the metric produces 
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FIG. 12: By changing the value of the Levenberg-Marquardt parameter, the course of the geodesies on the corresponding model 
graph are deformed, in turn distorting the shape of the cost contours in the geodesic coordinates, a) A = is equivalent to 
the model manifold. The cost contours for a relatively flat manifold, such as that produced by the sum of two exponentials, 
are nearly perfect, concentric circles. The geodesies can be evaluated up to the boundary of the manifold, at which point 
the coordinates are no longer defined. Here we can clearly see the stiff, long manifold direction (vertical) and the sloppy, 
thin manifold direction (horizontal) b) Small A, (A much smaller than any of the eigenvalues of the metric) will produce cost 
contours that are still circular, but the manifold boundaries have been removed. In this case the fold line has disappeared, and 
cost contours that ended where parameters evaporated now stretch to infinity, c) Moderate A creates cost contours that begin 
to stretch in regions where the damping parameter significantly affects the eigenvalue structure of the metric. The deformed 
cost contours begin to take the plateau and canyon structures of the contours in parameter space, d) Large A effectively 
washes out the information from the model manifold metric, leaving just a multiple of the parameter space metric. In this case, 
the contours are those of parameter space - a long narrow curved canyon around the best fit. This figure analogous to Fig. []J>, 
although the model here is a more sloppy (and more realistic) example. 



a similar anisotropy in the curvature of sloppy models. 
Furthermore, we use curvature as a measure of how far 
an algorithm can accurately step (section IVH Fl) and to 
estimate how many parameters a best fit will typically 
evaporate (section [VII G[) . 

In our discussion of geodesic coordinates in section fVTl 
we saw how some of the non-linearity of a model could 



be removed by a clever choice of coordinates. We also ar- 
gued that the non-linearity that could not be removed by 
a coordinate change would be expressed as an extrinsic 
curvature on the expectation surface. Non-linearity that 
does not produce an extrinsic curvature is not irrelevant; 
it can still have strong influence on the model and can 
still limit the effectiveness of optimization algorithms. 
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Specifically, this type of non-linearity changes the way 
that distances are measured on the tangent space. They 
may cause the basis vectors on the tangent space to ex- 
pand, shrink, or rotate. We follow the nomenclature of 
Bates and Watts and refer to this type of non-linearity as 
parameter-effects curvature [l^,0|. We emphasize that 
this is not a "real" curvature in the sense that it does not 
cause the shape of the expectation surface to vary from a 
flat surface, but its effects on the behavior of the model 
is similar to the effect of real curvature. This "curvature" 
could be removed through a more convenient choice of 
coordinates, which is precisely what we have done by 
constructing geodesic coordinates in section fVTl A func- 
tional definition of parameter-effects curvature would be 
the non-linearities that are annihilated by operating with 
P N . Alternatively, one can think of the parameter-effects 
curvature as the curvatures of the coordinate mesh. We 
discuss parameter-effects curvature in section I VII CI 

Bates and Watts refer to all non-linearity that cannot 
be removed by changes of coordinates as intrinsic cur- 
vature [la ]. We will not follow this convention; instead, 
we follow the differential geometry community and fur- 
ther distinguish between intrinsic or Riemann curvature 
(section lVII Al) and extrinsic or embedding curvature (36| 
(section [VII Bp . The former refers to the curvature that 
could be measured on a surface without reference to the 
embedding. The latter refers to the curvature that arises 
due to the manner in which the model has been embed- 
ded. From a complete knowledge of the extrinsic cur- 
vature, one could also calculate the intrinsic curvature. 
Based on our discussion to this point, one would expect 
that both the intrinsic and the extrinsic curvature should 
be expressible in terms of some combination of P N and 
d^d v r m . This turns out to be the case, as we will shortly 
see. 

All types of curvature appear in least squares models, 
and we will now discuss each of them. 



A. Intrinsic (Riemann) Curvature 

The embedding plays a crucial role in nonlinear least 
squares fits - the residuals embed the model manifold 
explicitly in data space - we will be primarily interested 
in the extrinsic curvature. However, because most studies 
of differential geometry focus on the intrinsic curvature, 
we discuss it. 

The Riemann curvature tensor, Rp 7 g is one measure 
of intrinsic curvature. Since intrinsic curvature makes no 
reference to the embedding space, curvature is measured 
by moving a vector, V 11 , around infinitesimal closed loops 
and observing the change the curvature induces on the 
vector, which is expressed mathematically by 

This expression in turn can be written independently of 
V 1 * in terms of the Christoffel symbols and their deriva- 




FIG. 13: Intrinsic and Extrinsic Curvature. Intrinsic 
Curvature is inherent to the manifold and cannot be removed 
by an alternative embedding. A model that is the sum of 
two exponential terms has all types of curvature. This is the 
same model manifold as in Fig.[T};, viewed from an alternative 
angle to highlight the curvature. From this viewing angle, the 
extrinsic curvature becomes apparent. This is also an example 
of intrinsic curvature. 

tives by the standard formula 

u Pj6 — °1 L 138 ~ °S L /3 7 T 1 f)S L e 7 — 1 /3j L e<5- 

From this we can express Rp^g in terms of derivatives of 
the residuals. Even though Rp^g depends on derivatives 
of r, suggesting that it would require a third derivative 
of the residuals, one can in fact represent it in terms of 
second derivatives and P , 

Rap-yS = d a dyr m P£ n dpd s r n - d a dgr m P^ ln dpd 7 r n , 

which the Gauss-Codazzi equation extended to the case 
of more than one independent normal direction [37| . 

The toy model that we have used throughout this work 
to illustrate concepts has intrinsic curvature. The curva- 
ture becomes most apparent when viewed from another 
angle, as in Fig. [T3] 

Intrinsic or Riemann curvature is an important mathe- 
matical quantity that is described by a single, four-index 
tensor; however, we do not use intrinsic curvature to 
study optimization algorithms. Extrinsic and parameter- 
effects curvature in contrast not be simple tensors but will 
depend on a chosen direction. These curvatures are the 
key to understanding nonlinear least squares fitting. 

B. Extrinsic Curvature 

Extrinsic curvature is easier to visualize than intrin- 
sic curvature since it makes reference to the embedding 
space, which is where one naturally imagines curved sur- 
faces. It is important to understand that extrinsic and in- 
trinsic curvature are fundamentally different and are not 
merely different ways of describing the same concept. In 
differentiating between intrinsic and extrinsic curvature, 
the simplest illustrative example is a cylinder, which has 
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FIG. 14: A ruled surface has no intrinsic curvature; how- 
ever, it may have extrinsic curvature. The model manifold 
formed from a single exponential rate and amplitude is an 
example of a ruled surface. This model could be isometrically 
embedded in another space to remove the curvature. 

no intrinsic curvature but does have extrinsic curvature. 
One could imagine taking a piece of paper, clearly a flat, 
two dimensional surface embedded in three dimensional 
space, and roll it into a cylinder. Rolling the paper does 
not affect distances on the surface, preserving its intrin- 
sic properties, but changes the way that it is embedded 
in three dimensional space. The rolled paper remains in- 
trinsically flat, but it now has an extrinsic curvature. A 
surface whose extrinsic curvature can be removed by an 
alternative, isometric embedding is known as a ruled sur- 
face [52]. While an extrinsic curvature does not always 
imply the existence of an intrinsic curvature, an intrinsic 
curvature requires that there also be extrinsic curvature. 
Our toy model, therefore, also exhibits extrinsic curva- 
ture as in Fig. [T3J One model whose manifold is a ruled 
surface is given by a two parameter model which varies 
an exponential rate and an amplitude: 



The manifold for this model with three data points is 
drawn in Fig.[H{7l|- 

There are two measures of extrinsic curvature that we 
discuss. The first is known as geodesic curvature as it 
measures the deviation of a geodesic from a straight line 
in the embedding space. The second measure is known 
as the shape operator. These two measures are compli- 
mentary, and should be used together to understand the 
way a space is curved. Both geodesic curvature and the 
shape operator have analogous measures of parameter- 
effects curvature that will allow us to compare the rela- 
tive importance of the two types of curvature. 

Measures of extrinsic and parameter effects curvature 
to quantify non-linearities have been proposed previously 
by Bates and Watts [15l Il7l lla|. Although the measure 
they use is equivalent to the presentation of the next few 
sections, their approach is different. The goal of this 
section is to express curvature measures of non-linearity 
in a more standard way using the language of differential 
geometry. By so doing, we hope to make the results 




FIG. 15: Geodesic Curvature. A direction on a curved 
surface define a geodesic. The deviation of the geodesic from 
a straight line in the embedding space is measured by the 
geodesic curvature. It is the inverse radius of the circle fit to 
the geodesic path at the point. 

accessible to a larger audience. 

1. Geodesic Curvature 

Consider a geodesic parametrized by t, tracing a path 
through parameter space, 9^{t), which in turn defines a 
path through residual space, f(9(r)). The parametriza- 
tion allows us to discuss the velocity, v = 4^, and the 
acceleration, a = ^ . A little calculus puts these expres- 
sions in a more practical form: 

a = e fi e v p N d fJ ,d u f. 

Notice that the normal projection operator emerges nat- 
urally in the expression for a. 

For any curve that has instantaneous velocity and ac- 
celeration vectors, one can find a circle that local approx- 
imates the path. The circle has radius 




and a corresponding curvature 

v 

Because the path that we are considering is a geodesic, 
it will be as near a straight line in data space as possible 
without leaving the expectation surface. That is to say, 
the curvature of the geodesic path will be a measure of 
how the surface is curving within the embedding space, 
i.e. an extrinsic curvature. The curvature associated with 
a geodesic path is illustrated in Fig. [T51 

In our previous discussion of geodesies, we saw that 
a geodesic is fully specified by a point and a direction. 
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Therefore we can define the geodesic curvature of any 
point on the surface, corresponding to a direction, v 1 *, by 



K{v) 



\v^v v P N d ll d v r\ 



v"v a 



(27) 



At each point an the surface, there is a different value of 
the geodesic curvature for each direction on the surface. 
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Another measure of extrinsic curvature, complimen- 
tary to the geodesic curvature, is the shape operator, 
Sfj, u - While the geodesic curvature requires us to choose 
an arbitrary direction on the surface, the shape operator 
requires us to choose an arbitrary direction normal to the 
surface. 

To understand the shape operator, let us first consider 
the special case of an iV-dimensional surface embedded 
in an N + 1-dimensional space. If this is the case, then at 
any point on the surface there is a unique (up to a sign) 
unit vector normal to the surface, n. If this is the case, 
is given by 



(28) 



S^v is known as the shape operator because it describes 
how the surface is shaped around the unit normal, n. It 
is a symmetric, covariant rank-2 tensor. We are usually 
interested in finding the eigenvalues of the shape operator 
with a single raised index: 

= S av . 

The eigenvectors of S£ are known as the principal curva- 
ture directions, and the eigenvalues are the extrinsic cur- 
vatures in those directions. In the case that there is only 
one direction normal to the surface, then the (absolute 
value of the) eigenvalues of are equal to the geodesic 
curvatures in the respective eigendirections. The eigen- 
values, {k^}, may be either positive or negative. Positive 
values indicate that the curvature is toward the direction 
of the normal, while negative values indicate that it is 
curving away, as illustrated in Fig. 1161 

In general, there will not be an unique normal vec- 
tor. If an iV-dimensional surface is embedded in an M- 
dimensional space, then there will M — N independent 
shape operators, and one is left to perform an eigenvalue 
analysis for each as described above [36j. Fortunately, 
for the case of a least squares problem, there is a natural 
direction to choose: the normal component of the unfit 
data, —P N r, making the shape operator 
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where we introduce the minus as convention. In gen- 
eral, around an arbitrary vector V, the shape operator 



FIG. 16: Shape Operator. Specifying a direction normal 
to a curved surface, n, defines a shape operator. The eigen- 
values of the shape operator are the principle curvatures and 
the corresponding eigenvectors are the directions of principle 
curvature. 



becomes 
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It should now be clear why these two measures of ex- 
trinsic curvature (geodesic curvature and the shape op- 
erator) are complimentary. The geodesic curvature is 
limited by having to choose a direction tangent to the 
surface, but gives complete information about how that 
direction is curving into the space normal to the surface. 
In contrast, the shape operator gives information about 
all the directions on the surface, but only tells how those 
directions curve relative to a single normal direction. 



C. Parameter-effects Curvature 

We are now prepared to discuss parameter-effects cur- 
vature. We repeat that parameter-effects curvature is 
not a curvature of the manifold. Instead, it is a mea- 
sure of the curvatures of the coordinate mesh on the 
surface. In our experience, parameter-effects curvature 
is typically the largest of the three types we have dis- 
cussed. By its very nature, this curvature depends on 
the choice of the parametrization. By constructing ex- 
tended geodesic coordinates in section IVI1 we were able 
to remove the parameter-effects curvature from the model 
(at a point). In this section we will discuss how to mea- 
sure the parameter-effects curvature and compare it to 
the other curvatures that we discussed above. 

To understand the meaning of parameter-effects cur- 
vature, let us begin by considering a linear model with 
no curvature of any type. For simplicity, we consider the 
parametrization of the xy-plane given by 

x = £0i + 6 2 
y = 6i + e6 2 . 

This parametrization will produce a skewed grid as e —¥ 1, 
characteristic of linear sloppy models, such as fitting 
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polynomials. This grid is illustrated in Fig. [TTr for 
e = 1/2. By reparametrizing the linear model, we can in- 
troduce parameter-effects curvature. For example, if we 
replace the parameters with their squares (which may be 
useful if we wish to enforce the positivity of the parame- 
ters' effects) 



y 



ei + ee. 



then the corresponding coordinate mesh will become 
compressed and stretched, as seen in Fig. 117b . Alter- 
natively, if we reparametrize the model as 

x2 



y 



(ef + ee 2 2 ) 



2 r 

2\ 2 



in order to limit the region of consideration to the 
upper-right quarter plane, then the coordinate mesh will 
stretch and rotate into itself, depicted in Fig. [17c . With 
more than two parameters, there is additionally a tor- 
sion parameter-effects curvature in which the lines twist 
around one another. None of these reparametrization 
change the intrinsic or extrinsic properties of the model 
manifold; they merely change how the coordinates de- 
scribe the manifold. The extent to which coordinate 
mesh is nonlinear is measured by the parameter-effects 
curvature. 

We now consider how to quantify parameter-effects 
curvature. We have discussed the normal and tangential 
projection operators, P N and P T , and argued that the 
normal projection operator would extract the extrinsic 
and intrinsic curvature from the matrix of second deriva- 
tives. Looking back on our expressions for curvature up 
to this point, we see that each involves P N . The com- 
plimentary parameter-effects curvature can be found by 
replacing P N with P T in each expression. Thus, in anal- 
ogy with Eq. (f2"7| . we can define the parameter-effects 
geodesic curvature by 

v a v a 



Likewise, we can define a parameter-effects shape opera- 
tor by comparison with Eq. (f2"5)) , 
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Recall that for an TV-dimensional space embedded in 
an M-dimensional space, there are M — N independent 
shape operators. This is because the space normal to 
the tangent space (into which we are projecting the non- 
linearity) is of dimension M — N . The parameter-effects 
analog must therefore have N independent shape oper- 
ators, since the projection space (the tangent space) is 
TV-dimensional. Therefore, we are naturally led to define 
a parameter-effects shape-operator with an additional in- 
dex to distinguish among the N possible tangent direc- 
tions, 



If we resolve these shape operators into the natural basis 
on the tangent space, = SP"d a r m , we find 

Therefore, the parameter-effects curvature is correctly in- 
terpreted as the connection coefficients. With this un- 
derstanding, it is clear that geodesic coordinates remove 
parameter-effects curvature, since they are the coordi- 
nates constructed to give T = 0. 

Finally, we note that from a complete knowledge of all 
the curvatures (for all directions) one can determine the 
matrix of second derivatives completely. Although we do 
not demonstrate this here, we note it is a consequence of 
having a flat embedding space. 



D. Curvature in Sloppy Models 

Based on our analysis thus far, we should have two 
expectations regarding the curvature of sloppy mod- 
els. First, because of the large spread of eigenvalues of 
the metric tensor, unit distances measured in parame- 
ter space correspond to large ranges of distances in data 
space. Conversely, one has to move the parameters by 
large amounts in a sloppy direction in order to change 
the residuals by a significant amount. Because of this, we 
expect that the anharmonicities in the sloppy directions 
will become magnified when we consider the curvature 
in those directions. We expect strong anisotropics in the 
curvatures of sloppy models, with the largest curvatures 
corresponding to the sloppiest directions. 

Secondly, as we saw in section IVI1 by changing coor- 
dinates to extended geodesic coordinates, we discovered 
that the manifold generated by our sloppy model was sur- 
prisingly flat, i.e. had low intrinsic curvature. We have 
seen that if the model happens to have equal number of 
data points as parameters, then the model will always be 
flat. Since many of the data points in a typical sloppy 
model are just interpolation points, we believe that in 
general sloppy models have lower extrinsic curvature than 
one would have naively guessed just by considering the 
magnitude of the non-linearities. This explains perhaps 
why we will find that the dominant curvature of sloppy 
models is the parameter-effects one. 

We can better understand the size of the various curva- 
tures by considering the interpretation presented in sec- 
tion IIIII that sloppy models are a generalized interpola- 
tion scheme. If we choose N independent data points as 
our parametrization, then the interpolating polynomial, 
fjV-i(i) in Eq. (|13p is a linear function of the parame- 
ters. As discussed below that equation, the manifold in 
each additional direction will be constrained to within 
e — SfN+i of P/v_i(i). Presuming that this deviation 
from flatness smoothly varies along the jth largest width 
Wj ~ 8fj of the manifold (i.e., there is no complex or sen- 
sitive dependence on parameters), the geodesic extrinsic 
curvature is 



P T 8 f) r 



K = e/Wf 



(32) 
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FIG. 17: a) Linear Grid. A sloppy linear model may have a skewed coordinate grid, but the shape of the grid is constant, 
having no parameter effects curvature, b) Compressed Grid. By reparametrizing the model, the grid may become stretched 
or compressed in regions of the manifold, c) Rotating, Compressed Grid. Another parametrization may not only stretch 
the grid, but also cause the coordinates to rotate. Parameter-effects curvature describes the degree to which the coordinates are 
stretching and rotating on the manifold. With more than two parameters, there is also a torsion parameter-effects curvature 
(twisting). 



predicting a range of extrinsic curvatures comparable to 
the range of inverse eigenvalues of the metric. Further- 
more, the ratio of the curvature to the inverse width 
should then be e/Wj ~ 5f N+l /5fj ~ (St/R) N+1 -\ 
where St is the spacing of time points at which the model 
is sampled and R is the time scale over which the model 



changes appreciably (see the argument in section Mil fol- 
lowing Eq. (fHijO . 

Since we estimate e = Sf^+i to be the most narrow 
width if the model had an additional parameter, we can 
find the overall scale of the extrinsic curvature to be given 
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by the narrowest width 
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Additionally, we can find the scale set by the parame- 
ter effects curvature by recalling that parameter effects 
curvature is the curvature of the coordinate mesh. If 
we ignore all parameter combinations except the stiffest, 
then motion in this direction traces out a one-dimensional 
model manifold. The parameter-effects curvature of the 
full model manifold in the stiffest direction now corre- 
sponds to the extrinsic curvature of this one-dimensional 
manifold [z3|, and as such is set by the smallest width 
(which in this case in the only width), i.e. the longest 
width of the full model manifold. The similar structure 
of parameter-effects curvature and extrinsic curvature, 
Eqs. ([27]) and (f3l"j) . suggest that the parameter-effects 
curvature also be proportional to the inverse eigenvalues 
(squares of the widths) along the several cross sections. 
Combining these result, we see that in general the ratio 
of extrinsic to parameter-effects curvature to be given by 
ratio of the widest to the most narrow width. 



(33) 



In our experience the ratio of extrinsic to parameter- 
effects curvature in Eq. (|33l) is always very small. When 
Bates and Watts introduced parameter-effects curvature, 
they considered its magnitude on twenty four models and 
found it universally larger than the extrinsic curvature, 
often much larger [15] . We have here offered an expla- 
nation of this effect based on the assumption that the 
deviation from flatness is given by Eq. (|3"2"|) . 

We explicitly check the assumption of Eq. (p?2"j) by cal- 
culating cross sections for a model of several exponentials 
and for an artificial neural network. We have already seen 
in section Mil in figure [6] that these widths span several 
orders of magnitude as predicted by the singular values 
of the Jacobian. In Fig. [T5]we view the data space image 
of these widths (projected into the plane spanned by the 
local velocity and acceleration vectors) , where we see ex- 
plicitly that the deviation from flatness is similar for all 
the cross sections. In Fig. [TO] we see that that the extrin- 
sic curvature is comparable to the narrowest cross section 
and the parameter-effects curvature is comparable to the 
widest cross section as we argued above, both for fitting 
exponentials and for the neural network model. 

We further illustrate the above analysis by explicitly 
calculating the curvatures for the sloppy model formed 
by summing several exponential terms with amplitudes. 
Fig. [20] is a log-plot illustrating the eigenvalues of the 
inverse metric, the geodesic curvatures in each of those 
eigendirections, as well as the parameter-effects geodesic 
curvature in each of those directions. We see the same 
picture whether we consider the eigenvalues of the shape 
operator or the geodesic curvature. Both measures of cur- 
vature are strongly anisotropic with both extrinsic cur- 
vature and parameter-effects curvature covering as many 




FIG. 18: a) Cross sections of a summing exponential 

model projected into the plane spanned by the velocity and 
acceleration vectors in data space at an arbitrary point near 
the center. Notice the widths of successive cross sections are 
progressively more narrow, while the deviations from flatness 
are uniformly spread across the width. The magnitude of the 
deviation from flatness is approximately the same for each 
width, giving rise to the hierarchy of curvatures, b) Cross 
sections of a feed forward neural network has many 
of the same properties as the exponential model. In both 
cases, the curvature is much smaller than it appears due to the 
relative scale of the two axes. In fact, the sloppiest directions 
(narrowest widths) have an aspect ratio of about one. 



orders of magnitude as the eigenvalues of the (inverse) 
metric. However, the extrinsic curvature is smaller by a 
factor roughly given by Eq. ([3"3"|). We will use this large 
discrepancy between extrinsic and parameter-effects cur- 
vature when we improve the standard algorithms in sec- 
tion ELLE1 

We have seen that manifolds of sloppy models possess 
a number of universal characteristics. We saw in sec- 
tion lllll that they are bounded with a hierarchy of widths 
which we describe as a hyper-ribbon. In this section we 
have seen that the extrinsic and parameter-effects cur- 
vature also possess a universal structure summarized in 
Figs. [18l[2"l"l A remarkable thing about the parameter- 
invariant, global structure of a sloppy model manifold 
is that is typically well-described by the singular values 
of the parameter-dependent, local Jacobian matrix. We 
saw in section Mil that the singular values correspond to 
the widths. We have now argued that the largest and 
smallest singular values set the scale of the parameter- 
effects and extrinsic curvatures respectively. This entire 
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FIG. 19: The extrinsic and parameter-effects curvature on the 
model manifold are strongly anisotropic, with the largest cur- 
vatures along the shortest widths (see Figs.[6l [T8)) . The slopes 
of the (inverse) curvature vs. eigenvalue lines are roughly 
twice that of the singular values (which are equivalent to the 
widths). The magnitude of the extrinsic curvature is set by 
the most narrow cross sections, while the magnitude of the 
parameter-effects curvature is set by the widest cross-section. 
Consequently the parameter-effect curvature is much larger 
than the extrinsic curvature. Here we plot the widths and 
curvatures for a model of four exponentials (above) from ref- 
erence [22| and a feed forward artificial neural network (below) 



structure is a consequence of the observation that most 
models are a multi-dimensional interpolation scheme. 

Let us summarize our conclusions about the geometry 
of sloppy models. We argued in section [TTT] using inter- 
polation theorems that multiparameter nonlinear least- 
squares models should have model manifolds with a hi- 
erarchy of widths, forming a hyper-ribbon with the n th 
width of order W n ~ WoA™ with A given by the spacing 
between data points divided by a radius of convergence 
(in some multidimensional sense) and Wo the widest cross 
section. We discovered in some cases that the eigenval- 
ues of the Hessian about the best fit agreed well with 
the squares of these widths (so A„ ~ A 2n , see Fig. [5]). 
This depends on the choice of parameters and the place- 
ment of the best fit; we conjecture that this will usually 
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FIG. 20: Curvature Anisotropy. a) Inverse Met- 
ric eigenvalues. The (inverse) metric has eigenvalues 
spread over several orders of magnitude, producing a strong 
anisotropy in the way distances are measured on the model 
manifold, b) Geodesic Curvature in eigendirections 
of the metric. The geodesic curvatures also cover many 
decades. The shortened distance measurements from the 
metric eigenvalues magnify the anharmonicities in the sloppy 
directions, c) Parameter-Effects Geodesic Curvature. 
The parameter-effects curvature is much larger than the ex- 
trinsic curvature, but shares the anisotropy. d) The eigen- 
values of the Shape Operator. The strong curvature 
anisotropy described by the geodesic curvature is also illus- 
trated in the eigenvalue spectrum of the shape operator, e) 
Parameter-Effects Shape Operator eigenvalues. Two 
measures (geodesic and shape operator curvatures) span sim- 
ilar ranges, but in both cases the parameter-effects curvature 
is a factor of about 10 larger than the extrinsic curvature 
equivalent. 



occur if the 'bare' parameters are physically or biologi- 
cally natural descriptions of the model and have natu- 
ral units (i.e., dimensionless), and if the best fit is not 
near the boundary of the model manifold. The parame- 
ter A will depend on the model and the data being fit; 
it varies (for example) from 0.1 to 0.9 among seventeen 
systems biology models Q. We argued using interpola- 
tion theory that the extrinsic curvatures should scale as 
K n ~ e/W 2 , where the total variation e ~ Wn, implying 
K n ~ A N /(W A 2n ) (Fig. 18c). We find this hierarchy 
both measured along the eigenvectors of the (parameter- 
independent) shape operator (Fig. [20} or the geodesic 
curvatures measured along the (parameter-dependent) 
eigenpredictions at the best fit. Finally, we note that 
the parameter effects curvature also scales as 1/A 2n by 
inspecting the similarity in the two formulae, Eqs. (f27|) 
and (|31l) . We argue that the parameter-effects curva- 
ture should be roughly given by the extrinsic curva- 
ture of a one-dimensional model moving in a stiff di- 
rection, which sets the scale of the parameter effects as 
K% ~ WojWl ~ l/(W A 2n ), again either measured 
along the eigendirections of the parameter-effects shape 
operator or along eigenpredictions. Thus the entire struc- 
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FIG. 21: A caricature of the widths and curvatures of a typ- 
ical sloppy model, a) The manifold deviates by an amount 
A^ from a linear model for each width. As each width is 
smaller than the last by a factor of A the curvature is largest 
along the narrow widths. This summary agrees well with the 
two real models in Fig. 1181 b) The scales of the extrinsic 
and parameter-effects curvature are set by the narrowest and 
widest widths respectively. The parameter-effects curvature 
is therefore smaller than the extrinsic curvature by a factor 
of A^. Both are strongly anisotropic. Compare this figure 
to with the corresponding result for the two real models in 
Fig. [m 

ture of the manifold can be summarized by three num- 
bers, Wo the stiffest width, A the typical spacing between 
widths, and N the number of parameters. We summarize 
our conclusions in Fig. [2~T1 



E. Curvature on the Model Graph 

Most of the non-linearities of sloppy models appear as 
parameter-effects curvature on the model manifold. On 
the model graph, however, these non-linearities become 
extrinsic curvature because the model graph emphasizes 
the parameter dependence. An extreme version of this 
effect can be seen explicitly in Fig. [5J where the model 
manifold, which had been folded in half, is unfolded in 
the model graph, producing a region of high curvature 



FIG. 22: Model Graph Curvature. As the Levenberg- 
Marquardt parameter, A, is increased, directions with highest 
curvature become less curved. For stiff directions with less 
extrinsic curvature, the parameter effects curvature may be 
transformed into extrinsic curvature. The damping term re- 
duces the large anisotropy in the curvature. For sufficiently 
large values of the Levenberg-Marquardt parameters, all cur- 
vatures vanish. 

around the fold line. 

If the Levenberg-Marquardt parameter is sufficiently 
large, the graph can be made arbitrarily flat (assuming 
the metric chosen for parameter space is flat, such as 
for the Levenberg metric). This effect is also visible in 
Fig- HI iii the regions that stretch toward the boundaries. 
In these regions, the Levenberg-Marquardt parameter is 
much larger than the eigenvalues of the metric, making 
the parameter space metric the dominant contribution, 
and creating an extrinsically flat region on the model 
graph. 

To illustrate how the curvature on the model graph 
is affected by the Levenberg-Marquardt parameter, we 
consider how the geodesic curvatures in the eigendirec- 
tions of the metric change as the parameter is increased 
for a model involving several exponentials with ampli- 
tudes and rates. The results are plotted in Fig. [2H As 
the Levenberg-Marquardt parameter is raised, the widely 
ranging values of the geodesic curvatures may either in- 
crease or decrease. The largest curvature directions (the 
sloppy directions) tend to flatten, but the directions with 
the lowest curvature (the stiff directions) direction be- 
come more curved. The main effect of the the Levenberg- 
Marquardt parameter is to decrease the anisotropy in the 
curvature. 

The behavior of the extrinsic curvature as the 
Levenberg-Marquardt parameter is varied can best be 
understood in terms of the interplay between parameter- 
effects curvature and extrinsic curvature. Curvatures 
decrease as more weight is given to the Euclidean, 
parameter-space metric. However, as long as the 
parameter-space metric is not completely dominant, the 
graph will inherit curvatures from the model manifold. 



27 



Since the graph considers model output versus the pa- 
rameters, curvature that had previously been parameter- 
effects become extrinsic curvature. Therefore, directions 
that had previously been extrinsically flat will be more 
curved, while the directions with the most curvature will 
become less curved. 

The largest curvatures typically correspond to the 
sloppy directions. Most algorithms will try to step in 
sloppy directions in order to follow the canyon. The ben- 
efit of the model graph is that it reduces the curvature 
in the sloppy directions, which allows algorithms to take 
larger steps. The fact that previously flat directions be- 
come extrinsically curved on the model graph does not 
hinder an algorithm that does not step in these extrin- 
sically flat directions anyway. The role that curvatures 
play in determining an algorithm's maximal step size is 
looked at more closely in the next section. 

F. Optimization Curvature 

The distinction between extrinsic and parameter- 
effects curvature is not particularly useful in understand- 
ing the limitations of an algorithm. An iterative algo- 
rithm taking steps based on a local linearization will ul- 
timately be limited by all non-linearities, both extrin- 
sic and parameter-effects. We would like a measure of 
non-linearity, analogous to curvature, that explains the 
limitations of stepping in a given direction. 

Suppose an algorithm proposes a step in some 
direction, w M , then the natural measure of non- 
linearity should include the directional second derivative, 
v^v v d ll d v r/v a v cn where we included the normalization in 
order to remove the scale dependence of v. This expres- 
sion is very similar to the geodesic curvature without the 
projection operator. 

Simply using the magnitude of this expression is not 
particularly useful because it doesn't indicate whether 
curvature of the path is improving or hindering the con- 
vergence of the algorithm. This crucial bit of information 
is given by the (negative) dot product with the unit resid- 
ual vector, 



which we refer to as the Optimization Curvature. Since 
the goal is to reduce the size of the current residual, the 
negative sign is to produce the convention that for n > 0, 
the curvature is helping the algorithm while when n < 
the curvature is slowing the algorithm's convergence. 

This expression for k has many of the properties of 
the curvatures discussed in this section. It has the same 
units as the curvatures we have discussed. It requires 
the specification of both a direction on the manifold (the 
proposed step direction, v) and a direction in data space 
(the desired destination, r), making it a combination of 
both the geodesic and shape operator measures of curva- 
ture. Furthermore, without the projection operators, it 



combines both extrinsic and parameter effects curvature 
into a single measure of non-linearity, although in prac- 
tice, it is dominated by the parameter-effects curvature. 
We now consider how k is related to the allowed step size 
of an iterative algorithm. 

Consider the scaled Levenberg step given by 

<5r = - (g° + XD)^ d u C St. 

Each A specifies a direction for a proposed step. For a 
given A, we vary St to find how far an algorithm could 
step in the proposed direction. We determine St by per- 
forming a line search to minimize the cost in the given 
direction. While minimizing the cost at each step may 
seem like a natural stepping criterion, it is actually a poor 
choice, as we discuss in section lVlII Ct however, this sim- 
ple criteria is useful for illustrating the limitations on step 
size. 

We measure the step size by the motion it causes in 
the residuals, ||5r||. This is a convenient choice because 
each direction also determines a value for the geodesic 
curvature (K), parameter-effects curvature (K p ), and an 
optimization curvature (k), each of which are measured 
in units of inverse distance in data space. We compare 
the step size with the inverse curvature in each direction 
in Fig. [H 

One might assume that the size of the non-linearities 
always limits the step size, since the direction was deter- 
mined based on a linearization of the residuals. This is 
clearly the case for the summing exponentials model in 
Fig. I2"5h , where k < 0; the step size closely follows the 
largest of the curvatures, the parameter effects curvature 

K P K\K\. 

However, the non-linearities on occasion may inadver- 
tently be helpful to an algorithm, as in Fig. I2l!b where 
k > 0. If the value of n changes sign as we vary A, 
then the distinction becomes clear: steps can be several 
orders of magnitude larger than expected if k > 0, oth- 
erwise they are limited by the magnitude of k. The sign 
of the parameter k is illustrating something that can be 
easily understood by considering the cost contours in pa- 
rameter space, as in Fig. |2"5H . If the canyon is curving 
"into" the proposed step direction, then the step runs up 
the canyon wall and must be shortened. However, if the 
canyon is curving "away" from the proposed step direc- 
tion, then step runs down the canyon and eventually up 
the opposite wall, resulting in a much larger step size. 

G. Curvature and parameter evaporation 

We have stressed the the boundaries of the model man- 
ifold are the major obstacle to optimization algorithms. 
Because a typical sloppy model has many very narrow 
widths, it is reasonable to expect the best fit parame- 
ters to have several evaporated parameter values when 
fit to noisy data. In order estimate the expected num- 
ber of evaporated parameters, however, it is necessary to 
account for the extrinsic curvature of a model. 
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FIG. 23: a) Curvature and Step Size for re < 0. If re < 0, then the non-linearities in the proposed direction are diverting 
the algorithm away from the desired path. Distances are limited by the size of the curvature, b) Curvature and Step Size 
for re > 0. The non-linearities may be helpful to an algorithm, allowing for larger than expected step sizes when re > 0. c) 
Curvature and Step Size for re with alternating sign. For small A, re < and the non-linearities are restricting the step 
size. However, if re becomes positive (the cusp indicates the change of sign), the possible step size suddenly increases, d) Cost 
contours for positive and negative values of re. One can understand the two different signs of re in terms of which side 
of the canyon the given point resides. The upper point has positive re and can step much larger distances in the Gauss-Newton 
direction than can the lower point with negative re, which quickly runs up the canyon wall. 



In Fig.[M]we illustrate how the curvature effects which 
regions of data space correspond to a best fit with either 
evaporated or finite parameters. A first approximation 
is a cross-sectional width with no extrinsic curvature, as 
in Fig. [24k . If the component of the data parallel to the 
cross-section does not lie outside the range of the width, 
the parameter will not evaporate. If the cross-section has 
curvature, however, the situation is more complicated, 
with the best fit depending on the component of the data 
perpendicular to the cross-section as well. Figs. l2"4Tb) 
and (c) highlight the regions of data space for which the 
best fit will not evaporate parameters (gray). 

Knowing both the regions of data space corresponding 
to non-evaporated parameters and the relative probabil- 
ities of the possible data (Eq. ©), we can estimate the 



expected number of evaporated parameters for a given a 
model at the best fit. Using Gaussian data of width a 
centered on the middle of a cross-section for a problem 
of fitting exponentials, we find the best fit and count the 
number zero-eigenvalues of the metric, corresponding to 
the number of non-evaporated parameters at the fit. 

We can derive analytic estimates for the number of 
evaporated parameters using the approximation that the 
cross section is either flat or has constant curvature as 
in Fig. UMb and b. If the cross-section is extrinsically 
flat then the probability of the corresponding parameter 
combination not evaporating is given in terms of the error 
function 

= 2 erf (^j , (35) 
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FIG. 24: The curvature along the width of a manifold effects if 
the best fit lies on the boundary or on the interior. For a cross- 
sectional width (thick black line), consider three possibilities: 
a) extrinsically flat, b) constant curvature along width, and c) 
curvature proportional to distance from the boundary. Grey 
regions correspond to data points with best fits on the interior 
of the manifold, while white regions correspond to data with 
evaporated parameters. If the curvature is larger near the 
boundaries, there is less data space available for evaporated 
best fit parameters. 



where W n is the n th width given by W n = Wo A". 

A similar formula for the constant curvature approxi- 
mation is a little more complicated. It involves integrat- 
ing the Gaussian centered on the cross section in Fig. RM1 
over the gray region. Since the apex of the gray cone is 
offset from the center of the Gaussian, we evaluate the in- 
tegral treating the offset as a perturbation. We recognize 
that there are several cases to be considered. If the noise 
is smaller than any of the widths, then the probability is 
approximately one. However, if the noise is larger than 
the width but smaller than inverse curvature, the prob- 
ability is given by W n /a. Finally, if the noise is larger 
than any of the widths the probability is W n K n . Recall 
that we characterize a sloppy model manifold by three 
numbers, Wo, A, and N, the largest width, the average 
spacing between widths and the number of parameters 
respectively. The final result in each of the three cases in 
terms of these three numbers is given by 

fl ifa<W n , 

pcurvedJw^ HW n <*<l/K n , (36) 

[A"-" i£l/K n <a. 

From our caricature of a typical sloppy model summa- 
rized in Fig. 12 1 1 we estimate how many widths should 
belong in each category for a given a. Summing the 
probabilities for the several widths in Eq. (|36[) we find 
the expected number of non-evaporated parameters to 
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TABLE I: The number of non-evaporated parameters (N) per 
total number of parameters N at the best fit, for an 8 parame- 
ter model of exponentials and amplitudes. As the noise of the 
data ensemble grows, the number of non-evaporated parame- 
ters at the best fit decreases (i.e. more parameters are evapo- 
rated by a good fit). Even if the noise is much larger than any 
of the widths, there are still several non-evaporated parame- 
ters, due to the curvature (see Fig. 1241) . We estimate the ex- 
pected number of non-evaporated parameters from both a flat 
manifold approximation (Eq. (|35[0 and a constant curvature 
approximation. For the constant curvature approximation we 
show the result of the exact integral of the gaussian over the 
grey region of Fig. 124b as well as our perturbative approxi- 
mation, Eq. (|37|l . using the parameters Wo = 6.1, A = .11 
and N = 8. These approximations agree with the numerical 
results when the noise is small, but for very noisy data there 
are still several non-evaporated parameters even if the noise is 
much larger than any of the widths. Therefore, although our 
general caricature of the model manifold as a hyper-cylinder 
of constant curvatures and widths seems to describe the geom- 
etry of the sloppy directions, it does not capture the features 
of the stiff directions. This discrepancy could be due, for ex- 
ample, by an increase in the curvature near the boundary as 
in Fig. I2lt . 

be given by 

v _ 2 log a I Wq 
(iV approx ) - l _ A + 1. 

In Table U we compare the fraction of non-evaporated 
parameters with the estimates from Eqs. (1351) and (1361) . 
We find a large discrepancy when the noise in the data 
is very large. In this case there is often a large fraction 
of non-evaporated parameters even if the noise is much 
larger than any cross-sectional width. We attribute this 
discrepancy to larger curvatures near the corners of the 
manifold that increase the fraction of data space that can 
be fit without evaporating parameters. Since the metric 
is nearly singular close to a boundary, we expect the ex- 
trinsic curvature to become singular also by inspecting 
Eq. (12~T|) . We explicitly calculate the curvature near the 
boundary and we find that this is in fact the case. 

The calculation in Table fl] can be interpreted in sev- 
eral ways. If one is developing a model to describe some 
data with known error bars, the calculation can be used 
to estimate the number of parameters the model could 
reasonably have without evaporating any at the best fit. 
Alternatively for a fixed model, the calculation indicates 
what level of accuracy is necessary in the data to confi- 
dently predict which parameters are not infinite. Qual- 
itatively, for a given model, the errors must be smaller 
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than the narrowest width for there to be no evaporated 
parameters. 

Similarly, for experimental data with noise less than 
any of the (inverse) parameter-effects curvatures the pa- 
rameter uncertainties estimated by the inverse Fisher in- 
formation matrix will be accurate since the parameter- 
ization is constant over the range of uncertainty. It is 
important to note, that for models with large numbers of 
parameters either of these conditions require extremely 
small, often unrealistically small, error bars. In general, 
it is more practical to focus on predictions made by en- 
sembles of parameters with good fits rather than parame- 
ter values at the best fit as the latter will depend strongly 
the noise in the data. 



VIII. APPLICATIONS TO ALGORITHMIC^ 



we consider possible improvements to algorithms, we will 
usually be willing to accept a few more function calls if 
it can significantly reduce the number of Jacobian eval- 
uations that an algorithm requires. 

In the next few sections, we discuss the geometric 
meaning of the Gauss- Newton method (section IVIII Ap 
and other similar algorithms, such as the Levenberg- 
Marquardt algorithm ( section IVIII B[) . We then discuss 
how ideas from differential geometry can lead to ways of 
improving convergence rates. First, we suggest a method 
of updating the Levenberg-Marquardt parameter, which 
we call delayed gratification, in section IVIII CI Second, 
we suggest the inclusion of a geodesic acceleration term 
in section IVIII Dl We end the discussion by comparing 
the efficiency of standard versions of algorithms to those 
with the suggested improvements in section IVIII El 



We now consider how the results derived in previous 
sections can be applied to algorithms. We have stressed 
that fitting sloppy models to data consist of two difficult 
steps. The first step is to explore the large, flat plateau to 
find the canyon. The second step is to follow the canyon 
to the best fit. 

We begin by deriving two common algorithms, the 
modified Gauss-Newton method and the Levenberg- 
Marquardt algorithm from the geometric picture in sec- 
tions EmS] and EnLSl We then suggest how it may be 
improved by applying what we call delayed gratification 
and an acceleration term in sections IVIII CI and IVIII Dl 

We demonstrate that the suggested modifications can 
offer improvements to the algorithm by applying them to 
a few test problems in section rVIH El In comparing the 
effectiveness of algorithms we make an important obser- 
vation, that the majority of the computer time for most 
problems with many parameters is occupied by Jacobian 
evaluations. As the number of parameters grows, this 
becomes increasingly the case. Models with many pa- 
rameters are more likely to be sloppy, so this assumption 
does not greatly reduce the applicability of the algorithms 
discussed. 

If an algorithm estimates the Jacobian from finite dif- 
ferences of the residuals, then most of the function (resid- 
ual) evaluations will be spent estimating the Jacobian. 
(Our function evaluation counts in Table |TT] do not in- 
clude function evaluations used to estimate Jacobians.) 
If this is the case, then for any given problem, comparing 
function evaluations automatically integrates the relative 
expense of calculating residuals and Jacobians. How- 
ever, many of the problems we use for comparison are 
designed to have only a few parameters for quick eval- 
uation, while capturing the essence of larger problems. 
We then extrapolate results from small problems to sim- 
ilar, but larger problems. Our primary objective is to 
reduce the number of Jacobian evaluations necessary for 
an algorithm to converge. We do not ignore the num- 
ber of function evaluations, but we but consider reducing 
the number of function calls to be a lower priority. As 



A. Modified Gauss-Newton Method 

The result presented in this paper that appears to be 
the most likely to lead to a useful algorithm is that cost 
contours are nearly perfect circles in extended geodesic 
coordinates as described in section [VTl The coordinates 
illustrated in Fig. I 111 transformed a long, narrow, curved 
valley into concentric circles. Searching for the best fit in 
these coordinates would be a straightforward task! This 
suggests that an algorithm that begins at an unoptimized 
point need only follow a geodesic to the best fit. We 
have thus transformed an optimization problem into a 
differential equation integration problem. 

The initial direction of the geodesic tangent vector (ve- 
locity vector) should be the Gauss-Newton direction 



^-(t = 0) = -g^d u C. 
dr 



(38) 



If we assume that the manifold is extrinsically flat (the 
necessary and sufficient condition to produce concentric 
circles in extended geodesic coordinates), then Eq. (p?o) 
tells us that the cost will be purely quadratic, 

d 2 C _ ^ v d9»dO v 

dr 2 ^ dr dr 
which implies that the first derivative of the cost will be 
linear in r: 



constant. 



(39) 
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dr dr 



+ C{t = 0). 



(40) 



A knowledge of C(t = 0) will then tell us how far the 
geodesic needs to be integrated: 



Cjr = 0) 

fjtv d8K- dB" 
y dr dr 



(41) 



We can calculate the missing piece of Eq. (f4"Tj) from the 
chain rule and Eq. ((38]), 

de^ 



c 



dr d » C 
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which gives us 



= 1. 



The simplest method one could apply to solve the 
geodesic equation would be to apply a single Euler step, 
which moves the initial parameter guess by 



-g^d v C, 



(42) 



since St = 1. Iteratively updating the parameters accord- 
ing to Eq. (|4"2"|) is known as the Gauss-Newton method. 
It can be derived without geometric considerations by 
simply assuming a linear approximation to the residu- 
als. Unless the initial guess is very good, however, the 
appearance of the inverse Hessian in Eq. (|4"2")l (with its 
enormous eigenvalues along sloppy directions) will result 
in large, unreliable steps and prevent the algorithm from 
converging. 

The Gauss-Newton method needs some way to shorten 
its steps. Motivated by the idea of integrating a dif- 
ferential equation, one could imagine taking several Eu- 
ler steps instead of one. If one chooses a time step 
to minimize the cost along the line given by the local 
Gauss-Newton direction, then the algorithm is known as 
the modified Gauss-Newton method, which is a much 
more stable algorithm than the simple Gauss-Newton 
method [53j . 

One could also imagine performing some more sophis- 
ticated method, such as a Runge-Kutta method. The 
problem with these approaches is that the sloppy eigen- 
values of the inverse metric require the Euler or Runge- 
Kutta steps to be far too small be competitive with other 
algorithms. In practice, these techniques are not as ef- 
fective as the Levenberg-Marquardt algorithm, discussed 
in the next section. 



B. Levenberg-Marquardt Algorithm 

The algorithm that steps according to Eq. (|4"2l using 
the metric of the model graph, Eq. (JTTJ) , is known as the 
Levenberg-Marquardt step: 



for the parameter values. In addition we observe that 
the resulting algorithm is more prone to parameter evap- 
oration. The purpose for adding D to the metric is to 
introduce parameter dependence to the step direction. 

The Levenberg-Marquardt algorithm adjusts A at each 
step. Typically, when the algorithm has just begun, the 
Levenberg-Marquardt term will be very large, which will 
force the algorithm to take small steps in the gradient di- 
rection. Later, once the algorithm has descended into a 
canyon, A will be lowered, allowing the algorithm to step 
in the Gauss-Newton direction and follow the length of 
the canyon. The Levenberg-Marquardt parameter, there- 
fore, serves the dual function of rotating the step direc- 
tion from the Gauss-Newton direction to the gradient 
direction, as well as shortening the step. 

As we mentioned in section IIV1 when using the Lev- 
enberg metric, A will essentially wash out all the sloppy 
eigenvalues of the original metric and leave the large ones 
unaffected. The relatively large multiplicative factor sep- 
arating eigenvalues means that A does not need to be 
finely tuned in order to achieve convergence. Neverthe- 
less, an efficient method for choosing A is the primary 
way that the Levenberg-Marquardt algorithm can be op- 
timized. We discuss two common updating schemes here. 

A typical method of choosing A at each step is de- 
scribed in Numerical Recipes |34j . One picks an initial 
value, say A = .001, and tries the proposed step. If the 
step moves to a point of larger cost, by default, the step 
is rejected and A is increased by some factor, 10. If the 
step has decreased the cost, the step is accepted and A 
is decreased by a factor of 10. This method is guaran- 
teed to eventually produce an acceptable step, since for 
extremely large values of A, the method will take an ar- 
bitrarily small step in the gradient direction. We refer to 
this as the traditional scheme for updating A. 

A more complicated method of choosing A is based on 
a trust region approach and is described in (45j. As in 
the previous updating scheme, at each step A is increased 
until the step goes downhill (all uphill steps are rejected). 
However, after an accepted step, the algorithm compares 
the decrease in cost at the new position with the decrease 
predicted by the linear approximation of the residuals 

\\r(0 old )\\-\\f(9 new )\\ 



\DY v d v C. 



\\r{9oid)\\ - r{9 old ) + J m S0m 



If D is chosen to be the identity, then the algorithm is 
the Levenberg algorithm pjjj . The Levenberg algorithm 
is simply the Gauss-Newton method on the model graph 
instead of the model manifold. 

If D is chosen to be a diagonal matrix with entries 
equal to the diagonal elements of g°, then the algo- 
rithm is the Levenberg-Marquardt algorithm (44[. As 
we mentioned in section IIV1 the Levenberg-Marquardt 
algorithm, using the Marquardt metric, is invariant to 
rescaling the parameters. We find this property to often 
be counterproductive to the optimization process since 
it prevents the modeler from imposing the proper scale 



If this value is very far from unity, then the algorithm has 
stepped beyond the region for which it trusts the linear 
approximation and will increase A by some factor even 
though the cost has decreased; otherwise, A is decreased. 
This method tunes A so that most steps are accepted, 
reducing the number of extra function evaluations. As 
a result, it often needs a few more steps, and therefore, 
a few more Jacobian evaluations. This algorithm works 
well for small problems where the computational com- 
plexity of the function and the Jacobian are comparable. 
It is not as competitive using the number of Jacobian 
evaluations as a measure of success. 
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These are certainly not the only update schemes avail- 
able. Both of these criteria reject any move that increases 
the cost, which is a natural method to ensure that the 
algorithm does not drift to large costs and never con- 
verges. One could imagine devising an update scheme 
that allows some uphill steps in a controlled way such 
that the algorithm remains well-behaved. We consider 
such a scheme elsewhere [54| and note that it was a key 
inspiration for the Delayed Gratification update scheme 
that we describe below in section IVIH CI 

As we observed in section [V] the metric formed by 
the model graph acts similarly to the effect of adding 
linear Bayesian priors as residuals. The Levenberg- 
Marquardt algorithm therefore chooses a Gauss-Newton 
step as though there were such a prior, but then ig- 
nores the prior in calculating the cost at the new point. 
A similar algorithm, known as the iteratively updated 
Gauss-Newton algorithm, includes the contribution from 
the prior when calculating the new cost, although the 
strength of the prior may be updated at each step [55| . 

C. Delayed Gratification 

We have seen that parameter-effects curvatures are 
typically several orders of magnitude larger than ex- 
trinsic curvatures for sloppy models, which means that 
the model manifold is much more flat than the non- 
linearities alone suggest and produce the concentric cir- 
cles in Fig. [TlJ When considering only a single step on 
even a highly curved manifold, if the parameter-effects 
curvature dominates, the step size will be less than the 
(inverse) extrinsic curvature and approximating the man- 
ifold by a flat surface is a good approximation. Fur- 
thermore, we have seen that when the manifold is flat, 
geodesies are the paths that we hope to follow. 

The Rosenbrock function is a well known test func- 
tion for which the extended geodesic coordinates can be 
expressed analytically. It has a long, parabolic shaped 
canyon and is given by 

n = 

r 2 = A (0 2 — 0\) , 

where A is a parameter that controls the narrowness of 
the canyon. The Rosenbrock function has a single min- 
imum at (0i, 82) = (ljl)' Since there are two residuals 
and two parameters, the model manifold is flat and the 
extended geodesic coordinates are the residuals. It is 
straightforward to solve 

X = 1-ri 

O2 = ^ + (l-r 1 ) 2 . 

If we change to polar coordinates, 

?'i = p sin <j) 
r 2 = ,ocos0, 
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FIG. 25: Extended Geodesic Coordinates for Rosen- 
brock Function. The residuals are one choice of extended 
geodesic coordinates if the number of parameters equal the 
number of data points, as is the case for the Rosenbrock func- 
tion. Because the Rosenbrock function is a simple quadratic, 
the coordinate transformation can be expressed analytically. 
Lines of constant p are equi-cost lines, while lines of constant 
<j> are the paths a geodesic algorithm should follow to the 
best fit. Because the geodesies follow the path of the narrow 
canyon, the radial geodesies are nearly parallel to the equi- 
cost lines in parameter space. This effect is actually much 
more extreme than it appears in this figure because of the 
relative scales of the two axes. 



then lines of constant 4> are the geodesic paths that we 
would like an algorithm to follow toward the best fit, and 
lines of constant p are cost contours. We plot both sets 
of curves in Fig. [551 

Inspecting the geodesic paths that lead to the best fit 
in Fig. [23 reveals that most of the path is spent following 
the canyon while decreasing the cost only slightly. This 
behavior is common to all geodesies in canyons such as 
this. We would like to devise an update scheme for A in 
the Levenberg-Marquardt algorithm that will imitate this 
behavior. The results of section IVH Fl suggest that we 
will often be able to step further than a trust region would 
allow, so we start from the traditional update scheme. 

The primary feature of the geodesic path that we wish 
to imitate is that radial geodesies are nearly parallel to 
cost contours. In the usual update scheme, if a proposed 
step moves uphill, then A is increased. In the spirit of 
following a cost contour, one could slowly increase the 
Levenberg-Marquardt parameter just until the cost no 
longer increases. If A is fine-tuned until the cost is the 
same, we call this the equi-cost update scheme. Such 
a scheme would naturally require many function evalua- 
tions for each step, but as we said before, we are primarily 
interested in problems for which function calls are cheap 
compared to Jacobian evaluations. Even so, determin- 
ing A to this precision is usually overkill, and the desired 
effect can be had by a much simpler method. 
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FIG. 26: Greedy Step and Delayed Gratification Step 
Criterion. In optimization problems for which there is a long 
narrow canyon, such as for the Rosenbrock function, choosing 
a delayed gratification step is important to optimize conver- 
gence. As the damping term is increased, the Gauss-Newton 
direction is rotated into the gradient direction, giving a num- 
ber of possible steps an algorithm might take. Choosing the 
step that lowers the cost the most will cause an algorithm to 
descend quickly into the canyon, greatly reducing the size of 
the future steps could take. This step choice is excessively 
greedy, and can be improved upon. An algorithm that takes 
the largest tolerable step size (in this case the largest step that 
does not move uphill), will not decrease the cost significantly 
in the first few steps, but will arrive at the best fit in fewer 
steps and more closely approximate the true geodesic path. 
What constitutes the largest tolerable step size should be op- 
timized for specific problems so as to guarantee convergence. 

Instead of precisely tuning A, we modify the tradi- 
tional scheme to raise and lower the parameter by differ- 
ent amounts. Increasing A by very small amounts when 
a proposed step is uphill and then decreasing it by a 
large amount when a downhill step is finally found will 
mimic the desired behavior. We have found that increas- 
ing by a factor of 2 and decreasing by a factor of 10 works 
well, consistent with Lampton's results [56j]. We call this 
method, the delayed gratification update scheme. 

The reason that this update scheme is effective is due 
to the restriction that we do not allow uphill steps. If 
we move downhill as much as possible in the first few 
steps, we greatly restrict the steps that will be allowed 
as successive iterations, slowing down the convergence 
rate, as illustrated in Fig. I2U1 

By using the delayed gratification update scheme, we 
are using the smallest value of A that does not produce 
an uphill step. If we choose a trust-region method, in- 
stead, each step will choose a much larger value of A. The 
problem with using larger values of A at each step, is that 
they drive the algorithm downhill prematurely. Even if 
the trust region only cuts each possible step in half com- 



pared to the delayed gratification scheme, the cumulative 
effect will be much more damaging because of how this 
strategy reduces the possibility of future steps. 

D. Geodesic Acceleration 

We have seen that a geodesic is a natural path that an 
algorithm should follow in its search for the best fit. The 
application of geodesies to optimization algorithms is not 
new. It has been applied, for example to the problem 
of nonlinear programming with constraints (57l . |58| . to 
neural network training [59( , and to the general problem 
of optimization on manifolds (33l . |60T |. Here we apply it 
as a second order correction to the Levenberg-Marquardt 
step. 

The geodesic equation is a second order differential 
equation, whose solution we have attempted to mimic 
by only calculating first derivatives of the residuals (Ja- 
cobians) and following a delayed gratification stepping 
scheme. From a single residual and Jacobian evaluation, 
an algorithm can calculate the gradient of the cost as well 
as the metric, which determines a direction. We would 
like to add a second order correction to the step, but one 
would expect its evaluation to require a knowledge of the 
second derivative matrix, which would be even more ex- 
pensive to calculate than the Jacobian. We have already 
noted that most of the computer time is spent on Jaco- 
bian evaluations, so second order steps would have even 
more overhead. Fortunately, the second order correction 
to the geodesic path can be calculated relatively cheaply 
in comparison to a Jacobian evaluation. 

The second order correction, or acceleration, to the 
geodesic path is given by 

a" = -r>V, (43) 

as one can see by inspecting Eq. (|2~4"1) . In the expression 
for the acceleration, the velocity contracts with the two 
lower indices of the connection. Recall from the defini- 
tion, 

that the lowered indices correspond to the second deriva- 
tives of the residuals. This means that the acceleration 
only requires a directional second derivative in the direc- 
tion of the velocity. This directional derivative can be 
estimated with two residual evaluations in addition to 
the Jacobian evaluation. Since each step will always call 
at least one residual evaluation, we can estimate the ac- 
celeration with only one additional residuals call, which 
is very cheap computationally compared to a Jacobian 
evaluation. 

With an easily evaluated approximation for the accel- 
eration, we can then consider the trajectory given by 

SO" = 9»6t+ \^5t 2 . (44) 
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FIG. 27: Geodesic Acceleration in the Rosenbrock Val- 
ley. The Gauss-Newton direction, or velocity vector, gives the 
correct direction that one should move to approach the best 
fit while navigating a canyon. However, that direction quickly 
rotates, requiring an algorithm to take very small steps in or- 
der to avoid uphill moves. The geodesic acceleration indicates 
the direction in which the velocity rotates. The geodesic accel- 
eration determines a parabolic trajectory that can efficiently 
navigate the valley without running up the wall. The linear 
trajectory quickly runs up the side of the canyon wall. 



By following the winding canyon with a parabolic path 
instead of a linear path, we expect to require fewer steps 
to arrive at the best fit. The parabola can more naturally 
curve around the corners of the canyon than the straight 
line path. This is illustrated for the Rosenbrock function 
in Fig. [27] Because the canyon of the Rosenbrock func- 
tion is parabolic, it can be traversed exactly to the best 
fit by the acceleration in a single step. 

The relationship between the velocity and the acceler- 
ation depicted in Fig. [27] for the Rosenbrock function is 
overly idealized. In general the velocity and the acceler- 
ation will not be perpendicular; in fact, it is much more 
common for them to be nearly parallel or anti-parallel. 
Notice that the expression for the connection coefficient 
involves a factor of the inverse metric, which will tend 
to bias the acceleration to align parallel to the sloppy di- 
rections, just as it does for the velocity. It is much more 
common for the acceleration to point in the direction 
opposite to the velocity, as for a summing exponentials 
model in Fig. [28b . 

Although an acceleration that is anti-parallel to the ve- 
locity may seem worthless, it is actually telling us some- 
thing useful: our proposed step was too large. As we reg- 
ulate the velocity by increasing the Levenberg-Marquardt 
parameter, we also regulate the acceleration. Once our 
velocity term is comparable to the distance over which 
the canyon begins to curve, the acceleration indicates into 
which direction the canyon is curving, as in Fig. 



If the damping term is too small, the acceleration 
points in the opposite direction to and is much larger 
than the velocity. This scenario is dangerous because it 
may cause the algorithm to move in precisely the op- 
posite direction to the Gauss-Newton direction, causing 
parameter evaporation. To fix this problem, we add an- 
other criterion for an acceptable step. We want the con- 
tribution from the acceleration to be smaller than the 
contribution from the velocity; therefore, we typically re- 
ject proposed steps, increasing the Leveberg-Marquardt 
parameter until 



< a, 



(45) 



where a is a chosen parameter, typically unity, although 
for some problems a smaller value is required. 

The acceleration is likely to be most useful when the 
canyon is very narrow. As the canyon narrows, the al- 
lowed steps become smaller. In essence, the narrowness of 
the canyon is determining to what accuracy we are solv- 
ing the geodesic equation. If the canyon requires a very 
high accuracy, then a second order algorithm is likely to 
converge much more quickly than a first order algorithm. 
We will see this explicitly in the next section when we 
compare algorithms. 

We have argued repeatedly that for sloppy mod- 
els whose parameter-effects curvature are dominant, a 
geodesic is the path that an algorithm should follow. One 
could object to this assertion on the grounds that, apart 
from choosing the initial direction of the geodesic to be 
the Gauss-Newton direction, there is no reference to the 
cost gradient in the geodesic equation. If a manifold is 
curved, then the geodesic will not lead directly to the 
best fit. In particular, the acceleration is independent of 
the data. 

Instead of a geodesic, one could argue that the path 
that one should follow is given by the first order differ- 
ential equation 

w" = 9 v - , (46) 

where we have introduced the denominator to preserve 
the norm of the tangent vector. Each Levenberg- 
Marquardt step chooses a direction in the Gauss-Newton 
direction on the model graph, which seems to be bet- 
ter described by Eq. (|46p than by the geodesic equation, 
Eq. (|2"4"]) . In fact Eq. (|4"6"]l has been proposed as a Neural 
Network training algorithm by Amari et al. [42| . 

The second order differential equation corresponding 
to Eq. ([46]) which can be found by taking the second 
derivative of the parameters, is a very complicated ex- 
pression. However, if one then applies the approximation 
that all non-linearities are parameter-effects curvature, 
the resulting differential equation is exactly the geodesic 
equation. By comparing step sizes with inverse curva- 
tures in Fig. [53] we can see that over a distance of sev- 
eral steps, the approximation that all non-linearities are 
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FIG. 28: a) De-acceleration when overstepping. Typically the velocity vector greatly overestimates the proper step size. 
(We have rescaled both velocity and acceleration to fit in the figure.) Algebraically, this is due to the factor of the inverse 
metric in the velocity, which has very large eigenvalues. The acceleration compensates for this by pointing anti-parallel to 
the velocity. However, the acceleration vector is also very large, as it is multiplied twice by the velocity vector and once by 
the inverse metric. To make effective use of the acceleration, it is necessary to regularize the metric by including a damping 
term, b) Acceleration indicating the direction of the canyon. As the Levenberg-Marquardt parameter is raised, the 
velocity vector shortens and rotates from the natural gradient into the downhill direction. The acceleration vector also shortens, 
although much more rapidly, and also rotates. In this two dimensional cross section, although the two velocity vectors rotate 
in opposite directions, the accelerations both rotate to indicate the direction that the canyon is turning. By considering the 
path that one would optimally like to take (along the canyon), it is clear that the acceleration vector is properly indicating the 
correction to the desired trajectory. 



parameter-effects curvature should be very good. In such 
a case, the deviation of Eq. (|4"o]) from Eq. will not 
be significant over a few steps. 

While the tensor analysis behind this result is long and 
tedious, the geometric meaning is simple and intuitive: 
if steps are much smaller than the extrinsic curvature on 
the surface, then the vector (in data space) corresponding 
to the Gauss-Newton direction can parallel transport it- 
self to find the Gauss-Newton direction at the next point. 
That is to say the direction of the tangent vector of a 
geodesic does not change if the manifold is extrinsically 
flat. 

Including second derivative information in an algo- 
rithm is not new. Newton's method, for example replaces 
the approximate Hessian of the Gauss- Newton method in 
Eq. ([5]), with the full Hessian in Eq. Many standard 
algorithms seek to efficiently find the actual Hessian, ei- 
ther by calculating it directly or by estimation [34], |6lJ . 
One such algorithm, which we use for comparison in 
the next section, is a quasi-Newton method of Broyden, 
Fletcher, Goldfarb, and Shannon (BFGS) [H, which es- 
timates the second derivative from an accumulation of 
Jacobian evaluations at each step. 

In contrast to these Newton-like algorithms, the 
geodesic acceleration is not an attempt to better approx- 
imate the Hessian. The results of section [VTI suggest that 
the approximate Hessian is very good. Instead of cor- 
recting the error in the size and direction of the ellipses 



around the best fit, it is more productive to account for 
how they are bent by non-linearities, which is the role of 
the geodesic acceleration. The geodesic acceleration is a 
cubic correction to the Levenberg-Marquardt step. 

There are certainly problems for which a quasi-Newton 
algorithm will make important corrections to the approx- 
imate Hessian. However, we have argued that sloppy 
models represent a large class of problems for which the 
Newton correction is negligible compared to that of the 
geodesic acceleration. We demonstrate this numerically 
with several examples in the next section. 



E. Algorithm Comparisons 

To demonstrate the effectiveness of an algorithm that 
uses delayed gratification and the geodesic acceleration, 
we apply it to a few test problems that highlight the 
typical difficulties associated with fitting by least squares. 

First, consider a generalized Rosenbrock function, 

c- (*,-£)"), 

where A and n are not optimizable parameters but set 
to control the difficulty of the problem. This problem 
has a global minimum of zero cost at the origin, with a 
canyon following the polynomial path 0™/ n whose width 
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geodesic acceleration not only improves the speed of con- 
vergence, but improves the likelihood of convergence, 
that is, the algorithm is less likely to evaporate param- 
eters. This is a consequence of the modified acceptance 
criterion in Eq. (|45|) . As an algorithm evaporates param- 
eters, it approaches a singular point of the metric on the 
model manifold, causing the velocity vector in param- 
eter space to diverge. The acceleration, however, also 
diverges, but much more rapidly than the velocity. By 
requiring the acceleration term to be smaller than the 
velocity, the algorithm is much more adept at avoiding 
boundaries. Geodesic acceleration, therefore, helps to 
improve both the initial search for the canyon from the 
plateau, as well as the subsequent race along the canyon 
to the best fit. 

Finally, we emphasize that the purpose of this sec- 
tion was to demonstrate that delayed gratification and 
geodesic acceleration are potentially helpful modifica- 
tions to existing algorithms. The results presented in 
this section do not constitute a rigorous comparison, as 
such a study would require a much broader sampling of 
test problems. Instead, we have argued that ideas from 
differential geometry can be helpful to speed up the fit- 
ting process if existing algorithms are sluggish. We are in 
the process of performing a more extensive comparison 
whose results will appear shortly [54j|. 



IX. CONCLUSIONS 



FIG. 29: Generalized Rosenbrock results for 
Levenberg-Marquardt variants. If the canyon that 
an algorithm must follow is very narrow (measured by the 
condition number of the metric at the best fit) or turns 
sharply, the algorithm will require more steps to arrive 
at the best fit. Those that use the geodesic acceleration 
term converge more quickly as the canyon narrows. As the 
parameter-effects curvature increases, the canyon becomes 
more curved and the problem is more difficult. Notice that 
changing the canyon's path from a cubic function in a) to a 
quartic function in b) slowed the convergence rate by a factor 
of 5. We have omitted the quadratic path since including the 
acceleration allows the algorithm to find the best fit in one 
step, regardless of how narrow the canyon becomes. 



is determined by A. To compare algorithms we draw 
initial points from a Gaussian distribution centered at 
(1, 1/n) with standard deviation of unity, and compare 
the average number of Jacobian evaluations an algorithm 
requires in order to decrease the cost to 10~ 4 . The results 
for the cubic and quartic versions of the problem are 
given in Fig. [29] for several version of the the Levenberg- 
Marquardt algorithm. 

We next consider a summing exponential problem; a 
summary of these results can be found in (22). Here we 
expand it to include the delayed gratification algorithm 
outlined above in section IVIH Gl 

A surprising result from Table [XT] is that including the 



A goal of this paper has been to use a geometric per- 
spective to study nonlinear least squares models, deriving 
the relevant metric, connection, and measures of curva- 
ture, and to show that geometry provides useful insights 
into the difficulties associated with optimization. 

We have presented the model manifold and noted 
that it typically has boundaries, which explain the phe- 
nomenon of parameter evaporation in the optimization 
process. As algorithms run into the manifold's bound- 
aries, parameters are pushed to infinite or otherwise 
unphysical values. For sloppy models, the manifold is 
bounded by a hierarchy of progressively narrow bound- 
aries, corresponding to the less responsive direction of pa- 
rameter space. The model behavior spans a hyper-ribbon 
in data space. This phenomenon of geometric sloppiness 
is one of the key reasons that sloppy models are difficult 
to optimize. We provide a theoretical caricature of the 
model manifold characterizing their geometric series of 
widths, extrinsic curvatures, and parameter-effects cur- 
vatures. Using this caricature, we estimate the number 
of evaporated parameters one might expect to find at the 
best fit for a given uncertainty in the data. 

The model graph removes the boundaries and helps 
to keep the parameters at reasonable levels. This is not 
always sufficient, however, and we suggest that in many 
cases, the addition of thoughtful priors to the cost func- 
tion can be a significant help to algorithms. 

The second difficulty in optimizing sloppy models 
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Algorithm 


Success Rate 


Mean NJEV 


Mean NFEV 


Trust Region LM 


12% 


1517 


1649 


Traditional LM 


33% 


2002 


4003 


Traditional LM + accel 


65% 


258 


1494 


Delayed Gratification 


26% 


1998 


8625 


Delayed Gratification + accel 


65% 


163 


1913 


BFGS 


8% 


5363 


5365 



TABLE II: The results of several algorithms applied to a test problem of fitting a sum of four exponential terms (varying both 
rates and amplitudes - 8 parameters) in log-parameters (to enforce positivity). Initial conditions are chosen near a manifold 
boundary with a best fit of zero cost near the center of the manifold. Among successful attempts, we further compare the 
average number of Jacobian and function evaluations needed to arrive at the fit. Success rate indicates an algorithm's ability 
to avoid the manifold boundaries (find the canyon from the plateau), while the number of Jacobian and function evaluations 
indicate how efficiently it can follow the canyon to the best fit. BFGS is a quasi newton scalar minimizer of Broyden, Fletcher, 
Goldfarb, and Shanno (BFGS) [621. |63|. The traditional [H, [44j and trust region [45l | implementations of Levenberg-Marquardt 
consistently outperform this and other general optimization routines on least squares problems, such as Powell, simplex, and 
conjugate gradient. Including the geodesic acceleration on a standard variant of Levenberg-Marquardt dramatically increases 
the success rate while decreasing the computation time. 



is that the model parameters are far removed from 
the model behavior. Because most sloppy models are 
dominated by parameter-effects curvature, if one could 
reparametrize the model with extended geodesic coordi- 
nates, the long narrow canyons would be transformed to 
one isotropic quadratic basin. Optimizing a problem in 
extended geodesic coordinates would be a trivial task! 

Inspired by the motion of geodesies in the curved 
valleys, we developed the delayed gratification update 
scheme for the traditional Levenberg-Marquardt algo- 
rithm and further suggest the addition of a geodesic ac- 
celeration term. We have seen that when algorithms must 
follow long narrow canyons, these can give significant im- 
provement to the optimization algorithm. We believe 
that the relative cheap computational cost of adding the 
geodesic acceleration to the Levenberg-Marquardt step 
gives it the potential to be a robust, general-purpose op- 
timization algorithm, particularly for high dimensional 
problems. It is necessary to explore the behavior of 
geodesic acceleration on a larger problem set to justify 
this conjecture [54| . 
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Appendix A: Information Geometry 

The Fisher information matrix, or simply Fisher infor- 
mation, /, is a measure of the information contained in a 
probability distribution, p. Let £ be the random variable 
whose distribution is described by p, and further assume 



that p depends on other parameters 9 that are not ran- 
dom. This leads us to write 

p = p(Z;0), 

with the log likelihood function denoted by I: 
I = log p. 

The information matrix is defined to be the expectation 
value of the second derivatives of Z, 

I r = <-^) = -f#rt>°)w&- ^ 

It can be shown that the Fisher information can be writ- 
ten entirely in terms of first derivatives: 

, dl dl , r „ tJ . ^ dl dl 

^(mw^jv^mw- (48) 

Eq. (|48p makes it clear that the Fisher information is 
a symmetric, positive definite matrix which transforms 
like a covariant rank-2 tensor. This means that it has all 
the properties of a metric in differential geometry. Infor- 
mation geometry considers the manifolds whose metric is 
the Fisher information matrix corresponding to various 
probability distributions. Under such an interpretation, 
the Fisher information matrix is known as the Fisher in- 
formation metric. 

As we saw in Section U least squares problems arise by 
assuming a Gaussian distribution for the deviations from 
the model. Under this assumption, the cost function is 
the negative of the log likelihood (ignoring an irrelevant 
constant). Using these facts, it is straightforward to ap- 
ply Eq. (|4"T1) or Eq. pg| to calculate the information met- 
ric for least squares problems. From Eq. (|47l) . we get 

d 2 C 

= ( qqvqqJ = ^{d»r m d v r m + r m <9 M cV m ), (49) 

m 
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where we have replaced / by g to indicate that we are 
now interpreting it as a metric. 

Eq. (j49| . being an expectation value, is really an 
integral over the random variable (i.e. the residuals) 
weighted by the probability. However, since the integral 
is Gaussian, it can be evaluated easily using Wick's theo- 
rem (remembering that the residuals have unit variance). 
The only subtlety is how to handle the derivatives of the 
residuals. Inspecting Eq. ([T]), reveals that the derivatives 
of the residuals have no random element, and can there- 
fore be treated as constant. The net result is 

9^ = ^2 9 n r m 9 v r m = {J T J)nv, (50) 

m 

since (r m ) = 0. Note that we have used the Jacobian 
matrix, J OTjU = 9^r m in the final expression. 

We arrive at the same result using Eq. (|48|) albeit using 
different properties of the distribution: 

Now we note that the residuals are independently dis- 
tributed, (r m r„) — S mril which immediately gives 
Eq. (T5D|) . the same metric found in Section fl] 

There is a class of connections consistent with the 
Fisher metric, known as the a-connections because they 
are parametrized by a real number, a [12J. They are 
given by the formula 

r$ e = (djdpdj + f^^) deldpldj). 

This expression is straightforward to evaluate. The result 
is independent of a, 

T % = 9 tK ^2 d ^ r mdp,d v r m . 
in 

It has been shown elsewhere that the connection corre- 
sponding to a — is in fact the Riemann connection. 
It is interesting to note that all the a-connections, for 
the case of the nonlinear least squares problem, are the 
Riemann connection. 

The field of information geometry is summarized nicely 
in several books [l2l [l3|. 

Appendix B: Algorithms 

Since we are optimizing functions with the form of 
sums of squares, we are primarily interested in algo- 
rithms that specialize in this form, specifically variants 
of the Levenberg-Marquardt algorithm. The standard 
implementation of the Levenberg-Marquardt algorithm 
involves a trust region formulation. A FORTRAN imple- 
mentation, which we use, is provided by MINPACK [64| . 

The traditional formulation of Levenberg-Marquardt, 
however, does not employ a trust region, but adjusts the 



Algorithm 1 Traditional Levenb erg- Marquardt as 
described in (H, |H, E|| 

1. Initialize values for the parameters, x, the 
Levenberg-Marquardt parameter A, as well as X up and Xdown 
to be used to adjust the damping term. Evaluate the 
residuals r and the Jacobian J at the initial parameter guess. 

2. Calculate the metric, g = J T J + XI and the cost gradient 
VC = J T r,C =\r 2 . 

3. Evaluate the new residuals, r new at the point given by 
Xnew = x — g~ 1 \7C , and calculate the cost at the new point, 
r - lr 2 

^new — 2 new ■ 

4. If Cnew < C, accept the step, x — x new and set r — r nem 
and A = X/ Xdown- Otherwise, reject the step, keep the old 
parameter guess x and the old residuals r, and adjust 

A = A x X U p. 

5. Check for convergence. If the method has converged, 
return x as the best fit parameters. If the method has not 
yet converged but the step was accepted, evaluate the 
Jacobian J at the new parameter values. Go to step 2. 



Levenberg-Marquardt term based on whether the cost 
has increased or decreased after a given step. An im- 
plementation of this algorithm is described in Numerical 
Recipes and summarized in Algorithm [TJ Typical 
values of X up and Xdown are 10- We use this formulation 
as the basis for our modifications. 

The delayed gratification version of Levenberg- 
Marquardt that we describe in section I VIII CI modifies 
the traditional Levenberg-Marquardt algorithm to raise 
and lower the Levenberg-Marquardt term by differing 
amounts. The goal is to accept a step with the smallest 
value of the damping term that will produce a downhill 
step. This can typically be accomplished by choosing 

X U p — 2 and Xdown = 10. 

The geodesic acceleration algorithm can be added to 
any variant of Levenberg-Marquardt. We explicitly add 
it to the traditional version and the delayed gratifica- 
tion version, as described in Algorithm [21 We do this by 
calculating the geodesic acceleration on the model graph 
at each iteration. If the step raises the cost or if the 
acceleration is larger than the velocity, then we reduce 
the Levenberg-Marquardt term and reject the step by 
default. If the step moves downhill and the velocity is 
larger than the acceleration, then we accept the step. 
For accepted steps we raise the Levenberg-Marquardt 
term; otherwise, we decrease the Levenberg-Marquardt 
term. In our experience the algorithm described in Algo- 
rithm [2] is robust enough for most applications; however, 
we do not consider it to be a polished algorithm. We will 
present elsewhere an algorithm utilizing geodesic accel- 
eration that is further optimized and that we will make 
available as a FORTRAN routine [Bl |. 

In addition to the variations of the Levenberg- 
Marquardt algorithm, we also compare algorithms for 
minimization of arbitrary functions not necessarily of the 
least squares form. We take several such algorithms from 
the Scipy optimization package [63|. These fall into two 
categories, those that make use of gradient information 
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Algorithm 2 Geodesic Acceleration in the traditional 
Levenberg-Marquardt algorithm 

1. Initialize values for the parameters, x, the 
Levenberg-Marquardt parameter A, as well as X up and Xdown 
to be used to adjust the damping term, and a to control the 
acceleration/velocity ratio. Evaluate the residuals r and the 
Jacobian J at the initial parameter guess. 

2. Calculate the metric, g — J T J + XI and the Cost gradient 
VC= J T r, C= \r 2 . 

3. Calculate the velocity v — —g~ VC, the geodesic 
acceleration of the residuals in the direction of the velocity 
a = -g- 1 J T (VV'cVcU) 

4. Evaluate the new residuals, r nem at the point given by 
Xnew = x + v + |a , and calculate the cost at the new point, 
(1 - ir 2 

^new — 2 ' new ■ 

5. If Cnew < C and |a[/[«| < a, accept the step, x — x new 
and set r = r nem and A = X/ Xdown- Otherwise, reject the 
step, keep the old parameter guess x and the old residuals r, 
and adjust A = A x X up . 

6. Check for convergence. If the method has converged, 
return x as the best fit parameters. If the method has not 
yet converged but the step was accepted evaluate the 
Jacobian J at the new parameter values. Go to step 2. 



and those that do not. Algorithms utilizing gradient in- 
formation include a quasi-Newton of Broyden, Fletcher, 
Goldfarb, and Shannon (BFGS), described in [s3|. We 
also employ a limited memory variation (L-BFGS-B) de- 
scribed in [65] and a conjugate gradient (CG) method 
of Polak and Ribiere, also described in [62]. We also 
explored the downhill simplex algorithm of Nelder and 
Mead and a modification of Powells' method [63], nei- 
ther of which make use of gradient information directly, 
and were not competitive with other algorithms. 
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The term parameter evaporation was originally used to 
describe the drift of parameters to infinite values in the 
process of Monte Carlo sampling [6(|. In this case the 
tendency of parameters to run to unphysical values is a 
literal evaporation caused by the finite temperature of the 
stochastic process. We now use the term to also describe 
deterministic drifts in parameters to extreme values in 
the optimization process. 

This example is also a separable nonlinear least squares 
problem. Separable problems containing a mixture of 
linear and nonlinear parameters are amenable to the 
method known as variable projection [67i - [69l | . Variable 
projection consists of first performing a linear least 
squares optimization on the linear parameters, making 
them implicit functions of the nonlinear parameters. The 
geometric effect of this procedure is to reduce the di- 
mensionality of the model manifold, effectively selecting 
a sub-manifold which now depends upon the location of 
the data. We will not discuss this method further in this 
paper, but we note that it is likely to have interesting 
geometric properties. 

This is strictly only true if the parameter-effects curva- 
ture has no compression component. Bates and Watts 
observe that typically, the compression is a large part of 
the parameter-effects curvature [15]. As long as the com- 
pression is not significantly larger than the rotation (i.e. 
is within an order of magnitude), the parameter-effects 
curvature will be the same order of magnitude as the 
extrinsic curvature of the one-dimensional model. 
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